Self-Assembling Protein Nanostructures

ABSTRACT

Synthetic nanostructures, proteins that are useful, for example, in making synthetic nanostructures, and methods for designing such synthetic nanostructures are disclosed herein.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent application Ser. No. 14/759,308 filed Jul. 6, 2015, which is a national phase filing of PCT Application Serial Number PCT/US14/15371 filed Feb. 7, 2014, which claims priority to U.S. Provisional Patent Application No. 61/762,194 entitled “General Method for Designing Multi-Component Protein Materials” filed Feb. 7, 2013, each entirely incorporated by reference herein for all purposes.

BACKGROUND OF THE INVENTION

Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.

Molecular self-assembly is an elegant and powerful approach to patterning matter on the atomic scale. Recent years have seen advances in the development of self-assembling biomaterials, particularly those composed of nucleic acids. DNA has been used to create, for example, nanoscale shapes and patterns, molecular containers, and three-dimensional macroscopic crystals. Methods for designing self-assembling proteins have progressed more slowly, yet the functional and physical properties of proteins make them attractive as building blocks for the development of advanced functional materials.

In any self-assembling structure, interactions between the subunits are required to drive assembly. Previous approaches to designing self-assembling proteins have satisfied this requirement in various ways, including the use of relatively simple and well-understood coiled-coil and helical bundle interactions, engineered disulfide bonds, chemical cross-links, metal-mediated interactions, templating by non-biological materials in conjunction with computational protein interface design, or genetic fusion of multiple protein domains or fragments which naturally self-associate.

In some scenarios, computational modeling and design of molecules can aid researchers in investigating the molecules. For example, computational protein design can provide valuable reagents for biomedical and biochemical research, identify sequences compatible with a given protein backbone, and design protein folds.

SUMMARY

In one aspect, isolated nanostructures are provided, comprising

(a) a plurality of first proteins that self-interact to form a first multimeric substructure comprising at least one axis of rotational symmetry;

(b) a plurality of second proteins that self-interact to form a second multimeric substructure comprising at least one axis of rotational symmetry;

wherein multiple copies of the first multimeric substructure and the second multimeric substructure interact with each other at one or more symmetrically repeated, non-natural, non-covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group.

The nanostructures of the invention may, for example, have a mathematical symmetry group is selected from the group consisting of tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry. In one embodiment, the first multimeric substructure comprises a dimer, trimer, tetramer, or pentamer of the first protein, and wherein the second multimeric substructure comprises a dimer or trimer of the second protein. In another embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a dimer of the second protein. In a further embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a trimer of the second protein. In another embodiment, the first protein and the second protein may be between 30-250 amino acids in length. In a still further embodiment, each symmetrically repeated instance of the non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure buries between 1000-2000 Å² of solvent-accessible surface area (SASA) on the first multimeric substructure and the second multimeric substructure. In another embodiment, each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure has a shape complementary value between 0.5-0.8. In a further embodiment, at least 50% of the atomic contacts comprising each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure is formed from amino acid residues residing in elements of alpha helix and/or beta strand secondary structure. Exemplary first and second proteins are disclosed herein.

In another aspect, the invention provides isolated proteins, comprising an amino acid sequence selected from the group consisting of SEQ ID NOS: 1-40, multimeric assemblies comprising a plurality of identical isolated protein monomers, recombinant nucleic acid encoding the isolated proteins, recombinant expression vector comprising the recombinant nucleic acids operatively linked to a promoter, and recombinant host cells, comprising the recombinant expression vectors of the invention, as well as kits comprising one or more of the compositions of the invention.

In a further aspect, a method is provided. A computing device generates a plurality of representations of a first protein building block. The computing device generates a plurality of representations of a second protein building block, where the first protein building block differs from the second protein building block. The computing device generates an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated mathematical symmetry group. The computing device computationally determines a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design. The computing device computationally modifies amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block in the docked configuration to specify a plurality of representations of protein-protein interfaces. The plurality of representations of protein-protein interfaces include one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration. The computing device generates an output that is based on at least one representation of the group consisting of: a representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart of an example method.

FIG. 2 depicts example protein architectures.

FIG. 3A-3F shows a method for building protein architectures.

FIGS. 4A, 4B, and 4C show three different symmetric fold tree representations using an example two-component architecture with D3 symmetry.

FIG. 5A shows SEC chromatograms of designed pairs of proteins and wild-type oligomeric proteins.

FIG. 5B shows a native PAGE analysis of in vitro-assembled T32-28 and T33-15 in cell lysates.

FIG. 5C-5G show respective native PAGE analyses of in vitro-assembled T32-28, T33-09, T33-15, T33-21, and T33-28 in cell lysates.

FIGS. 6A and 6B each show electron micrographs of designed two-component protein nanomaterials.

FIG. 7 shows computational design models and crystal structures of designed two-component protein nanomaterials

FIG. 8 is a block diagram of an example computing network.

FIG. 9A is a block diagram of an example computing device.

FIG. 9B depicts an example cloud-based server system.

FIG. 10A-10D shows the amino acid sequence of an exemplary protein (T32-28A) of the invention (SEQ ID NOs: 1, 11, 21, 31).

FIG. 11A-D shows the amino acid sequence of an exemplary protein (T32-28B) of the invention (SEQ ID NOs: 2, 12, 22, 32).

FIG. 12A-12B shows the amino acid sequence of an exemplary protein (T33-09A) of the invention (SEQ ID NOs: 3, 13, 23, 33).

FIG. 13A-13C shows the amino acid sequence of an exemplary protein (T33-09B) of the invention (SEQ ID NOs: 4, 14, 24, 34).

FIG. 14A-14D shows the amino acid sequence of an exemplary protein (T33-15A) of the invention (SEQ ID NOs: 5, 15, 25, 35).

FIG. 15A-15C shows the amino acid sequence of an exemplary protein (T33-15B) of the invention (SEQ ID NOs: 6, 16, 26, 36).

FIG. 16A-16D shows the amino acid sequence of an exemplary protein (T33-21A) of the invention (SEQ ID NOs: 7, 17, 27, 37).

FIG. 17A-17C shows the amino acid sequence of an exemplary protein (T33-21B) of the invention (SEQ ID NOs: 8, 18, 28, 38).

FIG. 18A-18C shows the amino acid sequence of an exemplary protein (T33-28A) of the invention (SEQ ID NOs: 9, 19, 29, 39).

FIG. 19A-19C shows the amino acid sequence of an exemplary protein (T33-28B) of the invention (SEQ ID NOs: 10, 20, 30, 40).

DETAILED DESCRIPTION

Natural protein assemblies are most often held together by many weak, noncovalent interactions which together form large, highly complementary, low energy protein-protein interfaces. Such interfaces spontaneously self-assemble and allow precise definition of the orientation of subunits relative to one another, which is critical for obtaining the desired material with high accuracy. Designing assemblies with these properties has been difficult due to the complexities of modeling protein structures and energetics.

A general computational method for designing self-assembling protein materials is disclosed, involving symmetrical docking of protein building blocks in a target symmetric architecture.

In some embodiments of the general computational method, the protein building blocks can include two or more distinct protein building blocks. Then, classes of nanomaterials can be constructed from docked configurations of the two or more distinct protein building blocks. Using multiple distinct protein building blocks can provide greater control over the assembly process and new functions. The nanomaterials can be engineered to encapsulate biomolecules of interest and deliver them to the cytosol of cultured cells to demonstrate their potential as next-generation targeted delivery vehicles.

The methods described herein can be used to design nanomaterials that combine several features of fundamental importance for their use in therapeutic applications. The nanomaterials can be designed with atomic-level accuracy that 1) underlies protein structure-function relationships, 2) is critical for the design of function, and 3) is currently inaccessible to other classes of materials such as synthetic nanoparticles or liposomes. Modular materials can be derived from these methods that enable the facile development of a variety of sophisticated functionalities. The nanomaterials can be “smart” materials that can respond in vitro or in vivo to therapeutically relevant environmental cues such as changes in pH.

Multi-component materials can enable design of larger cage-like assemblies with greater internal loading capacities, control over the initiation of assembly through mixing of separately purified components, and independent functionalization of each component. These three features are important for many potential downstream applications, including targeted delivery, vaccine design, and biosynthetic pathway engineering.

Software can simultaneously model multiple distinct subunit types in all of the symmetry groups relevant to protein structure, including helical, point group, layer group, and space group symmetries. The software can contain functionality for designing symmetric nanostructures, efficiently calculating scores, and sampling symmetric degrees of freedom.

Example Operations

FIG. 1 is a flow chart of an example method 100. Method 100 can begin at block 110, where a computing device, such as computing device 1000 described below in the context of at least FIG. 9A, can generate a plurality of representations of a first protein building block. At block 120, the computing device can generate a plurality of representations of a second protein building block, where the first protein building block differs from the second protein building block. In some embodiments, each of the first and second protein building blocks can include a synthetic polypeptide.

At block 130, the computing device can generate an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated mathematical symmetry group. In some embodiments, each of the plurality of the first and second protein building blocks can include a protein that shares an axis of symmetry with the designated mathematical symmetry group. In other embodiments, the designated mathematical symmetry group can conform to a symmetry selected from tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry. In still other embodiments, generating the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include computationally aligning symmetry axes of the first protein building block and the second protein building block with at least one axis in the designated mathematical symmetry group.

At block 140, the computing device can computationally determine a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design.

In some embodiments, determining a docked configuration of the plurality of the first and second protein building blocks can additionally include sampling rotational degrees of freedom and translational degrees of freedom for each of the first and second protein building blocks. In particular of these embodiments, sampling the rotational degrees of freedom and the translational degrees of freedom can include: selecting a rotational value for a rotational degree of freedom for each of the first and second protein building blocks; selecting a translational value for a translational degree of freedom for each of the first and second protein building blocks; determining a sampled representation of the first protein building block based on the selected rotational value for the first protein building block and the selected translational value for the first protein building block; determining a sampled representation of the second protein building block based on the selected rotational value for the second protein building block and the selected translational value for the second protein building block; and determining a designability measure for the docked configuration using the sampled representation of the first protein building block and the sampled representation of the second protein building block.

In more particular of these embodiments, determining the designability measure of the docked configuration can include determining a number of beta carbon contacts within a specified distance threshold between the sampled representation of the first protein building block and the sampled representation of the second protein building block in the docked configuration based on the values of the selected rotational and translational degrees of freedom.

At block 150, the computing device can computationally modify amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block in the docked configuration to specify a plurality of representations of protein-protein interfaces. The plurality of representations of protein-protein interfaces can include one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration using the computing device.

In some embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include selecting a selected representation of one or more amino acid sequences associated with a representation of at least one protein building block of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block. In particular of these embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include computationally mutating an amino acid sequence of the selected representation of one or more amino acid sequences. In other embodiments, computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block can include evaluating an energy of an amino acid mutation using a computational score function.

At block 160, the computing device can generate an output of the computing device that is based on at least one representation of the group consisting of: a representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences.

FIG. 2 depicts example protein architectures that can be designed using the method, in accordance with an example embodiment. FIG. 2 shows ten example architectures arranged roughly into two columns with architectures labeled O3, O42, I32, T32, I52 in the column, and labeled O32, O43, T32, I53, and T3 in the right column. The right column of architectures also includes a reference line indicating a reference distance of 15 nanometers within FIG. 2.

An architecture is labeled in FIG. 2 as either Xy or Xyz, where X is a letter, and y and z are numbers. The letter X represents the symmetry of the architecture: T for tetrahedral symmetry, O for octahedral symmetry, and I for icosahedral symmetry. The number y indicates a number of monomers in a first building block used to build the architecture and the number z indicates a number of monomers in a second building block used to build the architecture. If the number z is not present, then only one type of building block is used to build the structure. For examples, the O3 architecture at the top of the left column of FIG. 2, is an octahedral structure made up of one trimeric building block, the T33 architecture toward the bottom of the left column of FIG. 2 is a tetrahedral structure made up of two different types of trimer building blocks, and the I52 architecture at the bottom of the left column of FIG. 2 is an icosahedral structure made up of two multimer building blocks: a pentamer and a dimer.

FIG. 2 also indicates how many of each type of building block are utilized to build the structure. In the T32 architecture shown in the middle of the right column of FIG. 2 and in FIG. 3F, the structure is assembled from 4 trimers aligned along the tetrahedral three-fold symmetry axes and 6 dimers aligned along the two-fold symmetry axes. The T33 architecture (also shown in FIG. 3A-3E) is constructed from four copies of one trimer and four copies of a second trimer, with the three-fold symmetry axis of each trimer aligned at opposite poles of each tetrahedral three-fold symmetry axis.

Accurate Design of Coassembling Multi-Component Protein Nanomaterials

The self-assembly of proteins into highly ordered nanoscale architectures is a hallmark of biological systems. Compared to homooligomers, assemblies formed from multiple distinct components offer a wider range of possible structures due to their combinatorial nature, greater control over the timing of assembly, and enhanced modularity through independently addressable building blocks. Disclosed is a general computational method for designing protein nanomaterials in which two distinct types of subunits coassemble to a target symmetric architecture. The information necessary to direct assembly is encoded in designed protein-protein interfaces that precisely define the relative orientations of the building blocks. This method has been used to design five novel 24-subunit cage-like protein nanomaterials in two distinct symmetric architectures. The designed pairs of proteins self-assemble to form highly homogeneous nanocages when co-expressed in E. coli, and the assembly of two of the materials can be initiated upon demand by mixing independently produced components. Crystal structures of the materials are in close agreement with the computational design models at the level of both the designed interfaces and the overall architectures. The accuracy of the method and the universe of two-component materials that it makes accessible pave the way for the design of functional protein nanomaterials tailored to specific applications.

The level of structural complexity available to self-assembled nanomaterials generally increases with the number of unique molecular components used to construct the material. DNA nanotechnology provides an extreme example of this phenomenon: strategies have been developed for encoding specific and directional interactions between hundreds of distinct DNA strands, allowing the construction of nanoscale objects with essentially arbitrary structures. Here the structural and functional range of designed protein materials is expanded with a general computational method for designing two-component coassembling protein nanomaterials with high accuracy.

Software can be used to model multi-component systems; that is, systems consisting of multiple distinct protein subunits, each associated with a distinct symmetry group. Within the updated framework we disclose herein, each distinct subunit can be modified independently of one another, with the changes propagated to all symmetrically related copies.

FIGS. 3A-3F shows a method for building two-component symmetric protein nanostructures. FIGS. 3A-3E illustrates this method using a dual tetrahedral architecture (designated in FIG. 2 and FIG. 3A as T33) as an example. In the T33 architecture, four copies each of two distinct, naturally trimeric building blocks are aligned at opposite poles of the three-fold symmetry axes of a tetrahedron as shown in FIG. 3A. This alignment places one set of building blocks at the vertices of the T33 tetrahedron and the second set of building blocks at the centers of the faces of the T33 tetrahedron, totaling twelve subunits of each protein.

Each trimeric building block is allowed to rotate around and translate along its three-fold symmetry axis as indicated in FIG. 3B; other rigid body moves are disallowed because they would lead to asymmetry. These four degrees of freedom can be systematically explored during docking to identify configurations with interfaces that are suitable for design, as shown in FIG. 3C. The docking score function can maximize the number of inter-building block neighbors per residue and can favor residues in highly anchored regions of the protein structure that are unlikely to change conformation upon mutation of surface side chains as shown in FIG. 3D. A design algorithm, such as but not limited to the RosettaDesign algorithm, can be used to sample the identities and configurations of the side chains near the inter-building block interface, generating interfaces with features resembling those found in natural protein assemblies such as well-packed hydrophobic cores surrounded by polar rims, such as shown in FIG. 3E. The end result is a pair of new amino acid sequences, one for each building block, predicted to stabilize the modeled interface and therefore spontaneously drive assembly to the specific target configuration. These docking and design procedures were implemented in software to enable the simultaneous modeling of multiple distinct symmetrically arranged protein components. In particular, different components can be moved independently of one another while maintaining their internal degrees of freedom. This enables the design strategy described above to be generalized to a wide variety of symmetric architectures in which multiple symmetric building blocks are combined in geometrically specific ways. Combining even two symmetry elements can give rise to a large number of distinct symmetric architectures with a range of possible morphologies, including those with dihedral and cubic point group symmetries, as well as helical, layer, and space group symmetries.

As shown in the non-limiting examples that follow, the designed interfaces can drive assembly of cage-like nanomaterials that closely match the computational design models: the backbone RMSD over all 24 subunits in each material range from 1.0 to 2.6 Å. The precise control over interface geometry offered by our method thus enables the design of two-component protein nanomaterials with diverse nanoscale features such as surfaces, pores, and internal volumes with high accuracy.

The method described here can provide a general route to designing multi-component protein-based nanomaterials and molecular machines with programmable structures and functions. The capability to design highly homogeneous protein nanostructures with atomic-level accuracy and controllable assembly can open new opportunities in targeted drug delivery, vaccine design, plasmonics, and other applications that can benefit from the precise patterning of matter on the sub-nanometer to hundred nanometer scale.

Multi-Component Symmetric Modeling

The herein-described methods and techniques are not limited to use of RosettaDesign, the Rosetta™ software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to model multi-component symmetric protein nanostructures. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non-limiting, and the methods are in no way limited to the implementation disclosed herein.

As an example embodiment, the Rosetta™ software package was modified for multi-component symmetric modeling. Rosetta's symmetric modeling framework was updated out to enable modeling of multi-component systems; that is, systems consisting of multiple distinct protein subunits, each associated with a distinct symmetry group. Within this updated framework, each distinct subunit can be modified independently of one another, with the changes propagated to all symmetrically related copies. All of Rosetta's design and modeling functionality accessible to one-component symmetries is now accessible for multi-component symmetries as well, including efficient scoring calculations and sampling of symmetric degrees of freedom. These changes to Rosetta's symmetry machinery are illustrated in FIGS. 4A-4C and described briefly below. In both the one-component examples shown in FIGS. 4A and 4B and the multi-component example of FIG. 4C, the symmetry of a given target architecture is passed to Rosetta in the form of a symmetry definition file.

FIGS. 4A, 4B, and 4C show three different symmetric fold tree representations of an example D32 architecture. In each of FIGS. 4A, 4B, and 4C, the D32 architecture is made up of two trimeric building blocks, each shown in a relatively dark gray color, and three dimeric building blocks each shown in a relatively light gray color, arranged with D3 point group symmetry. Following the strategy described above, arranging the building blocks with D3 point group symmetry is accomplished by aligning the three-fold symmetry axes of the trimeric building blocks along the three-fold axis in D3 point group symmetry and the two-fold symmetry axes of the dimeric building blocks along the two-fold axes in D3 point group symmetry. In the examples shown in FIGS. 4A-4C, rigid body degrees of freedom (RB DOFs) are shown using gray lines. FIGS. 4A and 4B show examples with one component symmetry. FIG. 4A shows RB DOF J_(D3) connecting the master dimer subunit to the master trimer subunit. RB DOF J_(D3) is a child of RB DOFs J_(D1) and J_(D2) controlling the master dimer subunit; in this case the positions of the trimeric subunits depend on the positions of the dimeric subunits. That is, the RB DOFs of the trimeric building blocks shown in FIG. 4A depend on the RB DOFs of the dimeric building blocks.

In the example shown in FIG. 4B, RB DOF J_(T3) connecting the master trimer subunit to the master dimer subunit is a child of RB DOFs J_(T1) and J_(T2) controlling the master trimer subunit. Then, in FIG. 4B, the positions, and thus the RB DOFs, of the dimeric subunits depend on the positions, and thus the RB DOFs, of the trimeric subunits. FIG. 4C shows an example with multi-component symmetry. With multi-component symmetric modeling the RB DOFs controlling the master trimer subunit (J_(T1) and J_(T2)) are independent of the RB DOFs controlling the master dimer subunit (J_(D1) and J_(D2)); in the example of FIG. 4C, the positions of the dimeric subunits do not depend on the positions of the trimeric subunits and vice versa.

In some embodiments, only a single connection was allowed from the symmetric fold tree into the asymmetric unit. Thus, when modeling a system with multiple distinct symmetric components, only one such component could have its internal DOFs preserved. For example, in the D32 system shown in FIGS. 4A and 4B, if only one connection into the asymmetric unit is allowed, then one must choose to connect the two subunits in the asymmetric unit to either the three-fold axis (middle panel) or the two-fold axis (left panel). If both are connected to the three-fold axis, rotations around this connection will correctly preserve the internal DOFs of the trimer, but disrupt the internal DOFs of the dimer such as shown in FIG. 4B.

Other embodiments can enable multiple connections from the symmetric fold tree into the asymmetric unit, as the multi-component extension of symmetric modeling in Rosetta allows the asymmetric unit to be broken down into substructures that are independently managed by the symmetric fold tree. Using a multi-component symmetric fold tree in our D32 example allows the trimer to connect directly to the three-fold axis and the dimer to connect directly to the two-fold axis, thus any motions allowed by the symmetric architecture preserve the internal DOFs of both building blocks as shown in FIG. 4C.

In both the one-component and multi-component case, the symmetry of a given target architecture; e.g., T32 and T33 architectures, can be passed to Rosetta in the form of a symmetry definition file. The multi-component symmetry definition file syntax can be largely the same as the one-component syntax, with the additional requirement that the jumps connecting the protein subunits to the fold tree must specify which component is connected to each symmetry element.

Herein we define a symmetric architecture as a conceptual representation of a known mathematical symmetry group comprising at least one element of rotational symmetry, in which one or more of the elements of symmetry are explicitly considered; along each of the considered symmetry elements, multimeric protein building blocks with matching elements of symmetry can be aligned such that the symmetry elements of the building blocks and the designated symmetry group are collinear. Known mathematical symmetry groups with multiple different types of symmetry elements can be considered (for instance, octahedral point group symmetry contains two-fold, three-fold, and four-fold rotational symmetry elements); modeling nanostructures possessing these symmetries can require multiple distinct multimeric protein building blocks with distinct symmetries. In this way, a symmetric architecture defines: 1) the overall symmetry of the nanostructure being modeled, 2) the symmetries of the one or more distinct multimeric building blocks making up the symmetric nanostructure, and 3) the relative orientations of the symmetry axes of the one or more multimeric building blocks.

As a non-limiting example, a symmetric framework can be provided to model systems in two different symmetric architectures with tetrahedral point group symmetry. In one architecture, the assembly can be constructed from 4 trimers aligned along the tetrahedral three-fold symmetry axes and 6 dimers aligned along the two-folds; this architecture can be referred to as T32 (tetrahedron constructed from 3mers and 2mers). The second architecture, T33, can be constructed from four copies of one trimer and four copies of a second trimer, with the three-fold of each trimer aligned at opposite poles of each tetrahedral three-fold. Throughout the docking and design process the relative orientation of each of the two subunits within the trimers and/or dimers was maintained while allowing the trimeric or dimeric building blocks to translate along and rotate about the tetrahedral three-fold or two-fold symmetry axes.

The method disclosed herein can be used to model and design synthetic nanostructures possessing a wide variety of symmetries. In addition to the two-component tetrahedral symmetric architectures discussed above, nanostructures possessing octahedral or icosahedral point group symmetries can be modeled using the method, as well as nanostructures possessing dihedral point group symmetries, helical or line group symmetries, plane or layer group symmetries, or space group symmetries. In each symmetry, multimeric protein building blocks can be aligned along a subset of one or more of the elements of symmetry in the symmetry group in order to generate a synthetic nanostructure with the desired overall symmetry. The relative orientations of symmetry elements in all of the aforementioned symmetry group are known, and the symmetry definition file disclosed herein provides one general and non-limiting mechanism for providing this information to the computational design method.

Two-Component Symmetric Docking

The herein-described methods and techniques are discussed herein in the context of an example embodiment of the Rosetta™ software suite. However, there herein-described methods and techniques are not limited to use of RosettaDesign, the Rosetta™ software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to computationally dock multi-component symmetric protein nanostructures. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non-limiting, and the methods are in no way limited to the implementation disclosed herein. An application, tc_dock, was written within Rosetta™ to dock two distinct oligomeric building blocks into higher order symmetries in order to identify docked configurations predicted to be suitable for interface design. The required inputs for the tc_dock application are one PDB file containing a single subunit of the first scaffold component and a second PDB file containing a single subunit of the second scaffold component.

Sets of homodimeric and homotrimeric protein structures were curated to be input to our docking and design protocol. First, the PISA database was searched for all homodimeric or homotrimeric proteins passing the default criteria for dissociation energy, accessible surface area, buried surface area, percent buried surface area, and average chain length. The IDs obtained from PISA were then provided as input for the advanced search tool in the Protein Data Bank to select proteins clustered at 90% sequence identity with: 1) X-ray resolution less than 2 Å, 2) chain lengths of 75 to 200 amino acids, and 3) Escherichia coli as the host organism for protein expression. One trimeric protein that did not pass our automated selection criteria, PDB ID 3FTT, was added because of previous experience indicating it may serve as a successful design scaffold.

Coordinates for each of the selected PDB IDs were downloaded from the biological assemblies in the PDB and standardized for input to Rosetta. For biological assemblies containing multiple models with one chain per model, each model was treated as a separate chain. For assemblies containing multiple models with multiple chains per model, only the first model was considered. Alternative side chains and HETA™ records were removed, selenomethionines replaced with methionines, and the chain with the lowest average RMSD (as calculated by the super command in PyMOL) to all other chains was selected to be the input chain for design. Residues with missing main chain atoms were removed from the design input chain and its residues renumbered starting from 1. A new biological assembly was created in PyMOL by superimposing copies of the design input chain onto all other chains, and the assembly's symmetry axis was aligned along the vector [0, 0, 1] and its center of mass translated to the origin. Assemblies were discarded that were found to be too asymmetric, as assessed by the dispersion of symmetry axes implied by each tuple of symmetrically related atoms. For PDB IDs with multiple biological assemblies, the assembly with the lowest biological unit number found to match the expected C2 or C3 symmetry was chosen for design. The final set of 1,161 homodimeric proteins is listed in Table 1 below. And the final set of 200 homotrimeric proteins is listed in Table 2 below.

TABLE 1 1a1x_1 1a3c_1 1ag9_1 1alu_1 1alv_1 1az5_1 1b0x_1 1b8z_1 1bgf_1 1bm9_1 1btk_1 1buo_1 1byf_1 1byr_1 1c02_1 1cdc_1 1ci4_1 1ciz_2 1coz_1 1cxq_1 1d7j_3 1d9c_11dnl_1 1dz3_1 1e7l_1 1ecs_1 1eeq_1 1ek3_1 1ep0_1 1esr_1 1etx_1 1evx_1 1ex2_11ext_1 1eyv_1 1f1e_1 1f1g_1 1f1m_1 1f9f_1 1f9z_1 1fit_1 1fmb_1 1fux_1 1fzv_ 11g2i_1 1g2q_1 1g8q_1 1gd7_1 1gvj_1 1gvp_1 1gy6_1 1gy7_1 1gyx_1 1h8x_1 1he7_11hgx_1 1ht9_1 1hur_1 1i0r_1 1i12_1 1i3c_1 1i4y_1 1i9d_2 1ic2_2 1ihr_1 1ilk_11iq6_1 1is6_1 1ixl_2 1izm_1 1j24_1 1j27_2 1j2r_2 1j3m_1 1j3q_1 1j55_1 1j7g_11j8b_1 1j98_1 1jc4_1 1jhc_1 1jhg_1 1jk3_2 1jr8_1 1jrl_2 1jya_1 1k04_1 1k2e_11k3s_1 1k66_1 1k8u_1 1k9u_1 1kcq_2 1kl9_2 1kll_1 1kso_1 1l1q_1 1l3p_1 1l8s_11lgp_1 1lj9_1 1lq9_1 1ly1_1 1m0d_1 1m1f_1 1m2d_1 1m4i_1 1mi8_1 1mjh_1 1mk4_1 1mka_1 1mkk_1 1mp9_1 1msc_1 1mxi_1 1my6_1 1my7_1 1n99_1 1n9l_2 1ng2_2 1njh_11nki_1 1np6_1 1nqd_1 1nrz_5 1ns5_1 1nu3_1 1nxm_1 1nzn_2 1o22_1 1o3u_1 1o4t_11o4w_1 1o50_1 1o6a_1 1o6d_2 1oh0_1 1ohp_1 1oiv_1 1on2_1 1oqc_1 1oru_1 1ovs_21p6o_1 1pbj_2 1pdo_1 1psr_1 1puc_1 1pvm_1 1py9_1 1pzw_1 1q08_1 1q7s_3 1q8b_21q98_1 1q9u_1 1qip_1 1qou_1 1qto_1 1qwi_1 1r1t_1 1r1u_1 1r29_1 1r5q_2 1r7j_11r7l_1 1r9c_1 1rdo_1 1rfy_1 1rlk_2 1rxq_2 1s4k_1 1s67_1 1s7i_1 1s7z_2 1s99_11sd4_1 1sei_1 1sgm_1 1sh8_1 1sjw_2 1sjy_1 1sk4_2 1sl8_2 1snd_1 1t82_1 1t92_11tc5_1 1tfe_1 1tgj_1 1to4_1 1tu1_1 1tuh_1 1tuv_2 1tuw_1 1tvd_1 1twu_1 1u2w_11u3y_1 1u5f_2 1u69_2 1u7i_1 1uat_2 1udv_1 1ues_1 1ukk_1 1usc_1 1usm_1 1usp_11ut7_1 1uww_1 1uz3_1 1v05_1 1v2z_1 1v70_1 1v8y_1 1v96_1 1v9y_1 1vc1_1 1vh5_1 1via_1 1vj2_1 1vje_1 1vjl_1 1vkc_1 1vki_1 1vl7_1 1vq3_2 1vr7_1 1vzg_1 1w53_11wc9_1 1wkq_1 1wlt_1 1wn2_1 1woc_1 1wpn_1 1wu9_1 1wwc_2 1wwi_2 1wz3_1 1x0j_11x2i_1 1x6i_1 1x82_1 1x8d_1 1xe1_2 1xfs_1 1xhn_1 1xqa_1 1xrk_1 1xs0_1 1xso_11xsq_1 1xty_1 1xvq_2 1y0b_1 1y0h_1 1y0u_1 1y5h_1 1y7r_1 1y9q_1 1y9w_1 1yb3_21ybx_1 1ybz_2 1yfu_1 1ygt_1 1yhf_2 1yib_1 1ylk_1 1ylm_1 1yo3_1 1yoa_1 1yr0_11ysp_2 1z0p_1 1z2w_3 1z4e_1 1z9n_1 1z9p_1 1zb9_1 1zdn_3 1zhq_1 1zhv_2 1zj6_21zlj_1 1zn8_1 1zo2_1 1zop_1 1zps_1 1zpv_1 1zpw_2 1ztd_1 1zva_1 1zwy_1 1zxk_12a15_1 2a67_1 2a6c_1 2a72_1 2a8n_1 2a9s_1 2aan_2 2aao_3 2akp_3 2aps_1 2aqs_12asf_1 2auw_2 2b06_1 2b0a_1 2b0v_2 2b18_1 2b1y_1 2b3s_3 2b5a_1 2b5g_1 2b6h_22b8m_1 2b9a_1 2bbe_2 2bdr_1 2bnl_1 2bsj_1 2bz1_1 2c2i_1 2c9q_1 2car_1 2cvd_1 2cvi_1 2cwz_1 2cyy_1 2d37_1 2d4p_2 2d4u_1 2d5m_1 2d7v_1 2d8d_1 2dc3_1 2dc4_12dlb_1 2dm9_1 2dob_2 2dp9_1 2dpf_1 2dql_1 2duy_2 2dvk_1 2dxq_1 2e1f_2 2e1n_12e6u_2 2e8e_1 2eb1_1 2ebb_1 2ecu_1 2een_1 2ef8_1 2efv_1 2egd_1 2eh3_1 2ehp_12ei5_1 2eiq_3 2ejn_1 2eo4_1 2erb_3 2esu_2 2f22_1 2f4p_1 2f5g_1 2f62_1 2f99_22f9h_1 2fa1_1 2fa5_1 2fbh_1 2fbn_1 2fck_1 2fd5_1 2fe3_1 2fex_1 2fhq_1 2fip_12fiu_1 2fj9_2 2fjt_1 2fl4_1 2fpr_1 2fq4_1 2fr2_2 2fre_1 2ftr_1 2fu4_1 2fyq_12fyx_1 2g0c_2 2g0i_1 2g1u_1 2g3a_1 2g3r_2 2g7s_1 2g84_1 2gax_3 2gbt_1 2ge7_12gen_1 2gff_1 2glz_1 2goj_1 2gpc_1 2gu9_1 2gux_1 2gxg_1 2gyq_1 2gzv_2 2h1t_12h2b_1 2h8e_1 2h9u_1 2ha8_1 2hbo_1 2hcm_1 2hhg_2 2hhz_1 2hiq_1 2hkv_1 2hl0_12hlj_1 2hng_1 2hq9_1 2hql_1 2hs1_1 2hsb_1 2htd_1 2huh_1 2hur_2 2hyt_1 2hzc_2 2hzt_2 2i02_1 2i51_1 2i7a_1 2i7d_1 2i8b_1 2i8d_1 2i8t_1 2ia1_1 2ict_2 2idl_12iek_1 2if5_1 2ifx_1 2ig6_1 2igi_1 2ikk_1 2imj_1 2iml_1 2ims_1 2imz_1 2inb_12isy_1 2iu5_1 2ivy_1 2iwq_1 2ixk_1 2j6b_1 2j6y_1 2j7j_1 2j8m_1 2jar_1 2jba_12jdj_1 2je3_1 2jlj_1 2lig_1 2nlv_1 2nrk_1 2nsa_2 2nwv_1 2nx4_1 2nx8_1 2nyb_12nyc_1 2nyi_1 2nz7_1 2nzo_1 2o08_1 2o28_1 2o38_1 2o4t_1 2o6f_1 2o70_1 2o7m_12o95_1 2o99_1 2oa2_1 2oai_1 2ob5_1 2od4_1 2od6_1 2oda_1 2oee_1 2ogi_1 2oik_12okf_1 2oku_1 2olm_1 2omo_1 2onf_1 2oo2_1 2ooj_1 2ook_1 2opo_1 2oqk_2 2oqm_12oso_1 2ou3_1 2ou5_1 2ou6_1 2ouf_1 2ovs_1 2owp_1 2oyn_1 2oyz_1 2ozh_1 2ozj_12p08_1 2p09_1 2p12_1 2p25_1 2p3w_1 2p5q_1 2p7o_1 2p84_1 2p8g_1 2p8i_1 2p92_12pa7_1 2pey_1 2pfb_1 2pfi_1 2pfw_1 2pjs_1 2pk8_1 2pkh_1 2pmr_1 2pn0_1 2pn2_1 2pq3_2 2pqv_1 2prx_1 2pwo_1 2pyt_1 2q03_1 2q0y_1 2q20_1 2q24_1 2q2f_1 2q2h_12q2i_1 2q30_1 2q3p_1 2q3t_1 2q3x_1 2q4n_1 2q5c_3 2q79_1 2q82_1 2q8o_1 2q9k_12q9r_1 2qe9_1 2qhk_1 2qjw_1 2qkp_1 2ql8_1 2qml_2 2qmm_1 2qnd_3 2qnl_1 2qnt_12qqz_1 2qrr_1 2qsi_1 2qsw_1 2qtr_1 2qud_1 2qvm_2 2qx0_1 2r0x_1 2r1i_1 2r47_12r4i_1 2r6u_1 2r6v_1 2r78_1 2rbb_1 2rc3_1 2rcz_1 2rey_1 2rh0_1 2rhm_1 2ril_12riq_1 2rk3_1 2rk9_1 2rkf_1 2rkh_1 2uv4_1 2v57_1 2v90_1 2vez_1 2vkl_1 2voc_12vpk_1 2vs0_1 2vsv_1 2vvp_1 2vvw_1 2w1r_1 2w2a_1 2w31_1 2w4e_1 2w7w_1 2wb6_12wce_1 2wcr_1 2wcu_1 2wcw_1 2wfc_1 2wnx_1 2wp7_1 2wra_1 2wtg_1 2wzo_1 2x3g_12x5c_1 2x5h_1 2x5r_1 2x7z_1 2xbq_1 2xdp_1 2xf1_1 2xhf_1 2xr4_1 2xrh_1 2xxc_12y0o_1 2y39_1 2y6w_1 2y78_1 2yfd_1 2ykz_1 2yqy_3 2ysk_1 2yvo_1 2ywl_1 2yxh_1 2yz1_3 2yzk_1 2z10_1 2z6d_1 2z8u_1 2z98_1 2zcm_1 2zdo_1 2zdp_1 2zej_1 2zgl_12znd_1 2zpm_2 2zvy_1 2zw2_1 2zxy_1 3a2y_2 3a5p_1 3a6r_1 3a6s_3 3acd_1 3agx_13ah7_1 3aly_3 3b02_1 3b09_1 3b33_1 3b47_1 3b5g_1 3b5t_1 3b76_1 3b7c_1 3b7h_13b9c_1 3bb9_1 3bcw_1 3bde_1 3bln_1 3bm1_1 3bm7_1 3bmz_1 3bn7_1 3bpj_1 3bpv_13bqx_1 3bri_1 3bs3_1 3but_1 3by8_2 3byr_1 3bzh_1 3bzt_2 3c0f_1 3c1d_3 3c1q_13c3m_1 3c97_1 3can_1 3cb0_1 3cby_1 3ce1_1 3cex_1 3cjd_1 3cje_1 3cjn_1 3cm3_13cng_1 3cnk_1 3cnu_1 3cp3_1 3ct6_1 3cu3_1 3czt_1 3d00_1 3d0f_1 3d0j_1 3d0w_13d5p_1 3d7a_1 3db7_1 3dcm_1 3df8_1 3dib_2 3dlo_1 3dm8_1 3dmc_1 3dn7_1 3dnx_13do8_1 3dpj_1 3dr6_1 3dsb_1 3dz8_1 3e10_1 3e17_1 3e2c_1 3e39_1 3e4v_1 3e5h_13e8o_1 3ebt_1 3ec6_1 3ec9_1 3ecf_1 3f3x_1 3f43_1 3f7e_1 3f7l_1 3f8h_1 3f8x_1 3f9s_1 3fcd_1 3fcn_1 3fd7_3 3ff0_1 3ffy_1 3fg9_1 3fgv_1 3fgy_1 3fh1_1 3fjs_13fkc_2 3flj_1 3fm2_1 3fm5_1 3fmb_1 3fn7_1 3fov_1 3fqm_1 3frq_1 3fu1_1 3fv6_13fwz_1 3fx7_1 3fxh_1 3fyb_1 3fyn_1 3g0k_1 3g13_1 3g14_1 3g16_1 3g26_1 3g2b_13g46_1 3g7p_1 3g8g_1 3g8k_1 3g8z_1 3gby_1 3gdw_1 3gfa_1 3ggq_1 3ggu_1 3ghj_13gla_1 3glv_1 3gm5_1 3gpv_1 3grd_1 3guz_1 3gwk_1 3gwn_1 3gxh_3 3gya_2 3gyd_13gzr_1 3h05_1 3h0n_1 3h0x_2 3h1s_1 3h2d_1 3h36_1 3h3h_1 3h4o_1 3h4y_1 3h51_13h6q_2 3h8h_2 3h8u_1 3h95_1 3ha2_1 3ha9_2 3hcz_2 3hdc_1 3hf5_1 3hhv_1 3hiu_13hix_5 3hk4_2 3hm4_1 3hmf_2 3hmz_1 3hoi_1 3hqx_1 3hr7_1 3ht1_1 3huh_1 3hup_13hvv_1 3hx9_1 3hyq_2 3hzb_1 3hzp_1 3i24_1 3i3g_1 3ia1_1 3ia8_3 3ibm_1 3ifj_33ift_2 3igr_2 3iis_1 3ijm_1 3ilx_1 3in8_1 3inq_1 3ip0_2 3ir3_1 3itf_1 3ix3_1 3jrz_1 3jtf_1 3jtw_1 3jtz_1 3jum_1 3jx9_1 3k0z_1 3k1e_1 3k21_1 3k2v_1 3k3v_23k67_1 3k69_1 3k86_1 3kb5_2 3kbe_1 3kbq_1 3kby_1 3kg0_2 3kgz_1 3kk4_1 3kkg_13kl1_1 3kol_1 3kor_1 3kpc_1 3ksh_1 3ksv_1 3kuv_1 3kwk_1 3kyz_1 3l18_1 3l1e_13l1n_2 3l34_1 3l3u_1 3l46_1 3l7h_1 3l7x_1 3l8u_1 3l9y_1 3lag_1 3las_1 3lb5_13lby_1 3le4_1 3le5_1 3leq_2 3lf6_1 3lfh_1 3lfp_1 3lfr_1 3lhc_1 3lhr_1 3lin_13lio_1 3llv_2 3lmo_1 3lqn_1 3lqy_2 3lr0_1 3lte_1 3lw3_1 3lwc_1 3lx7_1 3lyd_13lyg_1 3lyh_1 3lyx_1 3lza_1 3lzl_1 3m1e_1 3m5b_1 3m6j_1 3m8e_1 3m9z_1 3mcw_13mdp_1 3mgd_1 3mgm_1 3mhx_1 3mmh_1 3mng_1 3msh_1 3mti_1 3mtq_1 3mws_1 3myf_13n1s_1 3n4j_1 3n4w_1 3n6y_3 3n8b_1 3nad_1 3nbc_1 3neu_1 3nfc_1 3nj2_1 3njc_13nl9_1 3nqn_1 3nr1_1 3nrh_3 3nrp_1 3ny5_1 3nym_1 3o0m_1 3o10_1 3o1c_1 3o2e_2 3o2r_1 3o79_3 3oa4_1 3obh_1 3oga_1 3ogh_1 3ohe_1 3oj7_1 3oji_1 3okx_1 3oms_13on4_1 3oni_2 3oop_1 3ose_2 3ov8_1 3oxp_1 3p0t_1 3p2t_2 3pc6_1 3pg6_1 3pmd_13pn3_3 3pp9_1 3pr6_2 3pu7_1 3q20_1 3q34_1 3q3y_3 3q62_1 3q63_1 3q64_1 3q6a_13q7r_1 3q8i_2 3q90_1 3qbm_1 3qdo_1 3qfl_1 3qh6_2 3qmq_1 3qoo_1 3qp4_1 3qp8_13qs2_1 3qul_1 3qzx_1 3r0n_1 3r5g_1 3r68_2 3r6a_1 3r6f_1 3rcp_2 3rd1_1 3rem_13rfi_2 3rkc_1 3rmh_1 3rmu_1 3rob_1 3rqi_1 3rt2_2 3s2r_1 3s45_1 3s6f_1 3s8i_13s9f_1 3sb1_1 3sd2_1 3sk2_1 3sl2_2 3sl7_1 3slz_1 3smd_2 3smj_3 3son_1 3soy_13svi_2 3sxm_1 3sz7_2 3szj_1 3t1s_1 3t43_1 3t46_2 3t8r_1 3t90_1 3t9y_1 3td4_43teq_1 3tgn_1 3tgv_1 3tj8_1 3tk0_1 3tnj_1 3tol_1 3trc_1 3typ_1 3tys_1 3u04_13u15_1 3u1d_1 3u2a_1 3u5v_1 3u6g_1 3u80_1 3ub6_1 3ucb_1 3ucg_1 3ufe_1 3uh9_1 3uie_1 3ulb_1 3ups_1 3urr_1 3uv0_1 3ux2_1 3vjz_1 3vk6_1 3vp5_1 3vql_1 3vub_13zrd_1 3zve_1 3zw5_1 3zxc_1 3zxq_1 3zy7_1 4a1i_1 4a5k_1 4a5n_1 4ae4_1 4aeq_14ag7_1 4agh_1 4alg_1 4avp_1 4ax2_1 4b4p_1 4b6i_1 4di0_1 4duq_1 4e08_1 4e0h_14e2g_1 4e74_1 4e7p_1 4eae_1 4egu_1 4em8_1 4err_1 4es1_1 4eun_1 4ew5_1 4ew7_14exo_1 4exr_1 4ezg_1 4f82_1 4f8y_1 4fak_1 4fiv_1 4flb_1 4fld_1 4g5a_1 4g6x_14gdh_1 4ghj_1 4giw_1 4go7_1 4gs3_2 4gwb_1

TABLE 2 1buu_1 1dbf_1 1dg6_1 1di6_1 1f7l_1 1fth_1 1gr3_1 1gu9_1 1gx1_1 1h7z_1 1h9m_11hfo_1 1idp_1 1iv2_1 1jd1_1 1jlj_1 1jq0_1 1jw8_2 1knb_1 1kr4_1 1lr0_2 1n2m_11nog_1 1nza_1 1o5j_1 1o91_1 1ocy_1 1oni_1 1ox3_1 1p1l_2 1pwb_1 1q5h_1 1q5x_11qu1_1 1rlh_2 1s55_1 1seh_1 1sjn_1 1t0a_1 1tcz_1 1td4_1 1u5x_1 1u9d_2 1ufy_11uku_1 1usn_2 1uuy_1 1uxa_1 1v3w_1 1ve0_1 1vfj_1 1vhf_2 1vmf_1 1vmh_1 1vph_11woz_1 1wy1_1 1x25_1 1xhd_2 1yq5_1 2aal_1 2bcm_1 2brj_1 2bt9_1 2bzv_1 2chc_12cu5_1 2cvl_1 2dt4_1 2e7a_1 2ed6_1 2eg2_1 2f0c_1 2fb6_2 2fvh_1 2g2d_1 2gdg_12gr8_1 2gw8_1 2h6l_1 2hx0_1 2ibl_1 2ieq_1 2ig8_1 2is8_1 2j2j_1 2j9c_1 2jb7_12nuh_2 2oj6_1 2ol1_1 2otm_1 2p2o_1 2p6c_1 2p6h_1 2p6y_1 2p9o_1 2pii_1 2qg8_12qih_1 2r32_1 2r6q_1 2rfr_1 2rie_1 2tnf_1 2uyk_1 2vnl_1 2w5p_1 2wds_1 2wh7_1 2wkb_1 2wpq_1 2wq4_1 2x4j_1 2xcz_1 2xdh_1 2xdj_1 2xx6_1 2y75_2 2yzj_1 2zhz_13aqe_7 3b64_1 3b8l_1 3bsw_1 3bzq_1 3c6v_1 3cc0_1 3ci3_1 3cp1_1 3d01_1 3d9x_13da0_1 3djh_1 3e6q_1 3eby_1 3efg_1 3ehw_1 3ejc_1 3ejv_1 3emf_1 3f09_1 3f0d_13f4f_1 3fq3_3 3ftt_1 3fuy_1 3fwt_1 3fwu_1 3gqh_1 3h5i_1 3h6x_1 3htn_1 3hwu_13hza_1 3i3f_1 3i7t_1 3ixc_1 3jv1_1 3k6a_1 3kan_1 3kjj_1 3laa_1 3lqw_1 3m1x_13mc3_1 3mci_1 3mdx_1 3mf7_1 3mhy_1 3mko_1 3mqh_1 3mxu_2 3n79_1 3ne3_2 3nfd_13o46_1 3oiu_2 3opk_1 3p48_1 3pzy_1 3qc7_1 3qr7_1 3quw_1 3r1w_1 3r3r_2 3rwn_13so2_1 3ta2_1 3tio_1 3tq5_1 3tqz_1 3uv9_2 3v4d_1 3vi6_2 3zw0_14aff_1 4g2k_14gb5_1 4gdz_1

The subunits can be arranged at the origin according to the symmetry specified by command-line options or through a user-provided symmetry definition file. Then the full space of contacting symmetric configurations can be sampled by systematically varying the translational and rotational degrees of freedom (DOFs) in the system. In order to test all four possible orientations of the two building blocks (inside/inside, inside/outside, outside/inside, outside/outside) two separate docking runs can be performed in which the orientation of one of the building blocks is reversed by setting the Rosetta command-line option tcdock::reverse to true. Configurations in which backbone or beta carbon atoms from different building blocks clash (distance between backbone amide nitrogen and carbonyl oxygen atoms <=2.6 Å; distance between all other backbone/beta carbon atom pairs <=3.0 Å) can be discarded.

In each non-clashing configuration, a designability score can be calculated. For example, the designability score can be calculated as the sum of the number of beta carbon contacts between building blocks (where a contact is defined as two beta carbon atoms within 12 Å), weighted by the type of secondary structures on which the contacting positions exist (by setting the Rosetta tcdock::cb_weight_secstruct command line option to true) and the average degree of connectivity (the number of amino acid positions within a user-specified distance threshold within the multimeric building block) of the contacting positions (by setting the Rosetta tcdock::cb_weight_average_degree command line option to true). This designability measure favors the selection of docked configurations with large numbers of contacting residues on well-anchored regions of protein structure. In addition to inter-component contacts, which can be contacts between building blocks of the two different components, two-component systems can also possess intra-component contacts or contacts between building blocks of the same component. The Rosetta command-line options tcdock::intra, tcdock::intra1, and tcdock::intra2 control the contribution to the designability score of intra-component contacts for both components, for component 1, and for component 2, respectively.

Data and PDB files can be output for a user-defined number of top scoring configurations (set by the Rosetta tcdock::topx command-line option). The data, which can be saved by redirecting the output of the run to a log file, includes the rigid body DOFs, the designability score, the number of carbon beta contacts between building blocks, the number of contacting residues between building blocks, the average score per carbon beta contact, and the average score per contacting residue.

In one example, the 1161 dimers and 200 trimers listed in the scaffold sets listed in Tables 1 and 2 provided 232,200 unique pairwise combinations of trimers with dimers, and 19,900 unique pairwise combinations of trimers. Docking was carried out for each of these unique combinations with or without the tcdock::reverse option set to true, for a total of 504,200 independent docking trajectories. The tcdock::intra option was set to false such that intra-component contacts were not included in the calculated scores.

For each unique scaffold combination, the 3 top scoring T33 docks were selected. This set of 59,700 distinct configurations was ranked by the average designability score per residue and the top 1,000 used as input for interface design. For T32, data was output for the 40 top scoring docked configurations per docking trajectory. This set of 18,576,000 distinct configurations was filtered to remove all configurations with less than 80 contacting residues between building blocks and ranked by the average designability score per residue. This set was filtered to retain only the one top ranked configuration for each unique scaffold pair and the top 1,000 configurations were used as input for interface design.

Two-Component Symmetric Interface Design

The herein-described methods and techniques are not limited to use of RosettaDesign, the Rosetta software suite, or any other specific software package. For example, other software programs could be used in conjunction with this method to design new amino acid sequences at protein-protein interfaces. As will be understood by those of skill in the art, the implementation of the design methods of the invention described below is non-limiting, and the methods are in no way limited to the implementation disclosed herein.

A set of protein-protein interface design protocols was developed within Rosetta to identify mutations predicted to drive assembly of two distinct protein building blocks into higher order symmetric complexes. The design functionality was broken into modular components and implemented within the RosettaScripts™ framework in order to facilitate future code development and to provide users the ability to modify each step of the design process without having to change the underlying C++ code.

The design process can have four stages: I) interface design, II) shape complementarity optimization, III) automated reversion, and IV) resfile-based refinement. The protocols used in each stage can take as input a symmetry definition file and a PDB file containing a single subunit of both scaffold proteins; the latter can be produced by concatenating the two scaffold protein PDB files used as input for docking and changing the chain of the second subunit to be “B”. In addition, initial values for the translational and rotational symmetric rigid body DOFs can be specified through user-defined variables. All design calculations can be performed on the two independent subunits and propagated symmetrically.

Stage I.

Interface design can involve carrying out multiple design trajectories for each docked configuration. At the start of each trajectory, the symmetric rigid body DOFs can be perturbed in order to sample nearby docked configurations. The behavior of these perturbations can be set by the user, including specifying whether to sample values from a user-defined grid of angles and displacements or randomly from user-defined uniform or Gaussian distributions of angles and displacements. Trajectories yielding docked configurations with clashing backbones (distance between backbone amide nitrogen and carbonyl oxygen atoms <=2.6 Å; distance between all other backbone/beta carbon atom pairs <=3.0 Å) can be discarded prior to interface design based on user-defined cutoff values for the number of clashing atoms.

In each of the remaining trajectories, interface residues can be selected according to the some or all of the following three criteria: 1) the residue has a beta carbon (alpha carbon in the case of glycine) within a user-defined cutoff distance to a beta carbon (alpha carbon in the case of glycine) in a different building block (in this study the default 10 Å cutoff was used), 2) the residue has a nonzero solvent accessible surface area when the protein subunits are in the unbound state, and 3), with the exception of residues that have high Lennard-Jones repulsive scores (fa_rep), the residue does not make contacts (any heavy atoms within 5 Å) with other subunits in the same oligomeric building block. Residues matching all three criteria can be considered designable, with the exception of proline and glycine, which are restricted to repacking. In some scenarios, criterion 3 is not enforced.

Residues fulfilling criteria 1 and 2 can be termed “interface positions” and criteria 1, 2, and 3 can be termed “design positions”. Then, all design positions are also interface positions, but not all interface positions are design positions. These positions can be updated at multiple points throughout design stages I through IV; appending any positions that newly satisfy the selection criteria to the previously defined sets. All residues not in the selected sets remain fixed throughout the design process. In addition, mutations to proline, glycine, or cysteine are prohibited unless explicitly specified otherwise by the user via a resfile (see stage IV). Optionally, a reduced amino acid set can be used during Stage I such that only the native amino acid and mutations to a subset of the 20 common amino acids are allowed at each design position.

Once the design positions have been selected, an initial round of design can be carried out using the standard RosettaDesign algorithm and a version of the Rosetta™ scorefunction, soft_rep, in which the Lennard-Jones repulsive term (fa_rep) is down-weighted to favor tightly packed interfaces. The scorefunction can be then set to score12 and the Rosetta energy is minimized through a series of small changes to the design position side chain configurations and the symmetric rigid body DOFs (i.e., the side chains and rigid body DOFS are symmetrically minimized). Designs with contacting interface areas not meeting user-defined thresholds can be discarded. For those designs passing the interface area cutoffs, the design positions can be updated and a second round of interface design is carried out using the standard RosettaDesign™ algorithm with the score12 score function. The design position side chains can be repacked and the interface position side chains and rigid body DOFs can be subjected to at least one round of minimization.

Several metrics can be used to gauge the quality of the interfaces resulting from this first stage of design and to select designs to carry forward to shape complementarity optimization in Stage II. These metrics include, but are not limited to: 1) the number of buried unsatisfied hydrogen bonds at the designed interface, 2) the shape complementarity of the designed interface, and 3) the predicted binding energy of the interface, defined as the difference in energy between the bound and unbound (individual building blocks) state following repacking of the side chains at the design positions and minimization of the side chains at the interface positions in the unbound state. For each passing design, the values of the final rigid body DOFs can be output to a scorefile along with the metric values and the standard score12 score terms, and a resfilecan be generated containing each of the design positions and their amino acid identities.

In one example, 100 independent design trajectories were run for each of the top 1000 docked T32 and T33 configurations (supra vide). At the start of each trajectory, the building blocks were displaced 2 Å away from the assembly's center of mass along their symmetry axes, and the translational rigid body DOFs were perturbed by sampling randomly from a Gaussian distribution with a standard deviation of 0.75 Å and the rotational rigid body DOFs were perturbed by sampling randomly from a Gaussian distribution with a standard deviation of 2 degrees. Trajectories yielding more than 8 clashing backbone atoms were removed from further design considerations. A reduced amino acid set was employed during this stage of the design process such that only mutations to the following 8 amino acids were allowed: alanine, aspartate, isoleucine, leucine, asparagine, serine, threonine, and valine. Additionally, during all RosettaDesign steps in all stages, the chi2 angle for aromatic side chains being repacked or designed was restricted to between 70 and 110 degrees.

T32 design trajectories yielding contacting interface areas of less than 1,100 Å² or greater than 2,000 Å² following the first round of design were discarded. The passing T32 designs were further filtered at the end of Stage I to remove those that had more than 45 mutations or 8 buried unsatisfied hydrogen bonds at the designed interface, a predicted binding energy greater than −12 REU, or a shape complementarity score of less than 0.60. The T33 design trajectories were filtered based on contacting interface areas at the end of Stage I rather than after the first round of design, discarding those that yielded contacting interface areas of less than 600 Å². The passing T33 designs were further filtered to remove those with more than 100 mutations or 10 buried unsatisfied hydrogen bonds at the designed interface, a predicted binding energy greater than −12 REU, or a shape complementarity of less than 0.55. The resulting 1,292 T32 designs and 593 T33 designs were subjected to the protocol described in Stage II below.

Stage II:

Stage II involves to regenerate the initial design from the two input scaffolds: 1) the rigid body DOFs output from Stage I are used to reposition the subunits in the fully assembled state, 2) the interface positions are re-selected using the same criteria as before, with the exception that all positions specified in the input resfile are included regardless of whether or not they fulfill the criteria in the input state, 3) the resfile output from stage I is used as input to the RosettaDesign algorithm to reintroduce the initial design mutations, and 4) the interface position side chains are subjected to one or more rounds of minimization and/or repacking.

Then, optimization techniques, such as greedy optimization, can test individual reversions to native amino acids at all mutated residues. A custom reversion score can be used in which individual mutations are filtered to remove those that increase the number of buried unsatisfied hydrogen bonds at the designed interface and scored according to the sum of the predicted binding energy, the total Rosetta energy, and a residue type constraint energy favoring the native amino acid. The potential reversions can be combined one at a time proceeding from the individually best scoring to worst scoring reversions at each position, only accepting those that do not increase the number of buried unsatisfied hydrogen bonds at the designed interface and improve the reversion score in the context of all previously accepted mutations. In some embodiments, the buried unsatisfied hydrogen bond criterion is optional; for example, this criterion was used for the T32 designs, but not T33.

Following another one or more rounds of interface position side chain minimization and/or repacking, optimization techniques are used to increase the shape complementarity of the designed interfaces. Mutations to all amino acids except cysteine, glycine, and proline can be tested individually at each design position as defined by the input resfile. Each mutation can be ranked by the shape complementarity of the design interface if the mutation does not: 1) increase the total Rosetta energy by more than 2.0 Rosetta energy units (REU), 2) decrease the predicted binding energy by 1.0 REU, 3) introduce any new unsatisfied hydrogen bonds, or 4) increase the fa_dun component of the score, which can be an internal energy of side chain rotamers as defined by statistics from the Dunbrack library, by more than 2.5 REU (the fa_dun criterion is optional; it was used for the T32 designs, but not T33). Next, mutations cam be combined one at a time proceeding from the best scoring to worst scoring individual mutations, only accepting those that still pass the same three or four criteria and improve the shape complementarity in the context of all previously accepted mutations. During both the reversion and shape complementarity optimization, all of the interface positions can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation.

In addition to the standard Rosetta scores, the following metrics, and perhaps others, can used to assess the quality of each design following one or more rounds of interface position side chain minimization and/or repacking: 1) the total number of mutations, 2) the number of buried unsatisfied hydrogen bonds at the interface, 3) the average degree of each design position, 4) the RosettaHoles packing score, 5) the average total Rosetta energy, fa_atr, fa_rep, and fa_dun for each filter position, 6) the contacting interface area, 7) the predicted binding energy, 8) the shape complementarity, and 9) the change in predicted binding energy resulting from individual mutations of each interface side chain to alanine (i.e., a computational alanine scan of the designed interface). Those designs passing a set of user-defined thresholds for each metric are subsequently subjected to visual inspection to further filter the designs. A scorefile with the metric values and the standard score12 score terms, and a resfile containing the design positions and their amino acid identities is generated for each design at the end of Stage II.

In one example, the T32 designs resulting from Stage II were filtered to remove those with a shape complementarity score less than 0.65, predicted binding energies of greater than −25 REU, a positive Rosetta holes score for the designed interface, an interface area less than 1,200 Å², or more than 1 buried unsatisfied hydrogen bond at the designed interface. The 283 passing T32 designs were visually inspected and manually curated down to a list of 68 designs that were subjected to the reversion protocol outlined in Stage III. The T33 designs resulting from Stage II were filtered, visually inspected and manually curated down to a list of 38 designs that were subjected to the reversion protocol outlined in Stage III

Stage III:

The third stage in the design process can identify, via an automated computational process, mutated residues predicted not to be critical for assembly and to revert them back to their native amino acid identities. This helps to minimize the number of mutations being made to the scaffold proteins and reduces the amount of refinement required in Stage IV.

Stage III can be begin by regenerating the design from the two input scaffolds using the rigid body DOFs from stage I and the resfile output from stage II: 1) the rigid body DOFs can be used to reposition the subunits in the fully assembled state, 2) the interface positions can be re-selected in the same manner as in Stage II, 3) the resfile can be used as input to the RosettaDesign algorithm to reintroduce the initial design mutations, and 4) at least one round of interface position side chain and rigid body DOF minimization, side chain repacking, and minimization is performed.

Next, greedy optimization or another optimization algorithm can be used to revert mutations to the native amino identities as follows. During the first part of the optimization algorithm, each reversion can be tested individually and ranked by the change in shape complementarity if the reversion does not: 1) decrease the predicted binding energy by more than 2.0 REU, 2) increase the number of buried unsatisfied hydrogen bonds at the interface, or 3) decrease the shape complementarity of the interface by more than 0.02. During the second part of the optimization algorithm, reversions that passed the first part can be combined one at a time proceeding from the best scoring to the worst scoring individual mutations, only accepting those that still pass the three criteria above in the context of all previously accepted mutations. Then, optimization can be terminated if a mutation passes these criteria but causes the predicted binding energy to be greater than a user-defined threshold (in one example, −15 REU was used for T32 designs and −17 REU for T33 designs) or the shape complementarity to be less than 0.65. During both parts of optimization, all interface positions can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation. Furthermore, during the second part, the reference structure for measuring the change in shape complementarity can be reset after each accepted mutation.

Following at least one round of rigid body and side chain minimization, side chain repacking, and minimization, the full suite of additional metrics can be evaluated (as outlined at the end of Stage II) with the additional calculation of a Boltzmann weighted estimation of the probability of each designed side chain configuration in the bound versus the unbound state. For each design, the values of the final rigid body DOFs are output to a score file along with the additional metrics and the standard score12 score terms, and a resfile is generated containing the interface positions and their amino acid identities.

In one example, all 68 T32 designs and 38 T33 designs resulting from Stage III were run through the resfile-based refinement protocols outlined in Stage IV below.

Stage IV:

Stage IV of the design process can involve one or more iterations of resfile-based redesign with user-guided mutations. In each iteration of the process, a combination of visual inspection and analysis of the design metrics can be used to generate modified resfiles for each design, with each modified resfilecontaining a small number of user-defined mutations relative to a correspondingresfile output from Stage III. Two different protocols, resfile_optimize and resfile_design, can be used to test the user-defined mutations. In both protocols, the starting configuration can be generated from the two input scaffolds using the rigid body DOFs from the previous round of design.

The resfile_optimize protocol uses greedy optimization to test the user-defined mutations. First the reverted design resulting from Stage III can be regenerated using the unmodified resfile output from Stage III together with the standard RosettaDesign™ algorithm, and the side chains specified in the resfile are minimized, repacked, and minimized. Next, user-defined mutations can be tested individually at each design position. Each mutation can be ranked by the change in shape complementarity of the designed interface, if the mutation does not decrease the predicted binding energy by greater than 2.0 REU or decrease the shape complementarity of the designed interface by more than 0.02. The passing mutations are then combined one at a time proceeding from the best ranked to the worst ranked individual mutations, only accepting those that still do not decrease the binding energy by more than 2.0 REU or the shape complementarity by more than 0.02 in the context of all previously accepted mutations. Optimization can be terminated if a mutation passes these criteria, but causes the predicted binding energy to be greater than −15 REU or the shape complementarity to be less than 0.63. All positions specified in the input resfile can be subjected to at least one round of minimization, repacking, and minimization prior to evaluating the effects of each mutation. Furthermore, during the combining stage, the reference structure for measuring the change in predicted binding energy and the change in the shape complementarity can be reset after each accepted mutation.

The resfile_design protocol involves taking the starting design configuration generated using the rigid body DOFs from the previous round of design and applying the standard RosettaDesign algorithm with the user-defined resfile.

In both protocols, the symmetric rigid body DOFs and the side chains specified in the input resfile are minimized, side chains repacked, and minimized prior to calculating the full suite of design metrics. This process can be iterated until designs are obtained which are deemed suitable for experimental testing or until the user decides the designs are no longer worth pursuing.

Example Computing Environment

FIG. 8 is a block diagram of an example computing network. Some or all of the above-mentioned techniques disclosed herein, such as but not limited to techniques disclosed as part of and/or being performed by software, the Rosetta™ software suite, RosettaDesign™, Rosetta™ applications, and/or other herein-described computer software and computer hardware, can be part of and/or performed by a computing device. For example, FIG. 8 shows protein design system 902 configured to communicate, via network 906, with client devices 904 a, 904 b, and 904 c and protein database 908. In some embodiments, protein design system 902 and/or protein database 908 can be a computing device configured to perform some or all of the herein described methods and techniques, such as but not limited to, method 100 and functionality described as being part of or related to the Rosetta™ software suite. Protein database 908 can, in some embodiments, store information related to and/or used by the Rosetta™ software suite.

Network 906 may correspond to a LAN, a wide area network (WAN), a corporate intranet, the public Internet, or any other type of network configured to provide a communications path between networked computing devices. Network 906 may also correspond to a combination of one or more LANs, WANs, corporate intranets, and/or the public Internet.

Although FIG. 8 only shows three client devices 904 a, 904 b, 904 c, distributed application architectures may serve tens, hundreds, or thousands of client devices. Moreover, client devices 904 a, 904 b, 904 c (or any additional client devices) may be any sort of computing device, such as an ordinary laptop computer, desktop computer, network terminal, wireless communication device (e.g., a cell phone or smart phone), and so on. In some embodiments, client devices 904 a, 904 b, 904 c can be dedicated to problem solving/using the Rosetta software suite. In other embodiments, client devices 904 a, 904 b, 904 c can be used as general purpose computers that are configured to perform a number of tasks and need not be dedicated to problem solving. In still other embodiments, part or all of the functionality of protein design system 902 and/or protein database 908 can be incorporated in a client device, such as client device 904 a, 904 b, and/or 904 c.

Computing Device Architecture

FIG. 9A is a block diagram of an example computing device (e.g., system) In particular, computing device 1000 shown in FIG. 9A can be configured to: include components of and/or perform one or more functions of protein design system 902, client device 904 a, 904 b, 904 c, network 906, and/or protein database 908 and/or carry out part or all of any herein-described methods and techniques, such as but not limited to method 100. Computing device 1000 may include a user interface module 1001, a network-communication interface module 1002, one or more processors 1003, and data storage 1004, all of which may be linked together via a system bus, network, or other connection mechanism 1005.

User interface module 1001 can be operable to send data to and/or receive data from external user input/output devices. For example, user interface module 1001 can be configured to send and/or receive data to and/or from user input devices such as a keyboard, a keypad, a touch screen, a computer mouse, a track ball, a joystick, a camera, a voice recognition module, and/or other similar devices. User interface module 1001 can also be configured to provide output to user display devices, such as one or more cathode ray tubes (CRT), liquid crystal displays (LCD), light emitting diodes (LEDs), displays using digital light processing (DLP) technology, printers, light bulbs, and/or other similar devices, either now known or later developed. User interface module 1001 can also be configured to generate audible output(s), such as a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices.

Network-communications interface module 1002 can include one or more wireless interfaces 1007 and/or one or more wireline interfaces 1008 that are configurable to communicate via a network, such as network 906 shown in FIG. 8. Wireless interfaces 1007 can include one or more wireless transmitters, receivers, and/or transceivers, such as a Bluetooth transceiver, a Zigbee transceiver, a Wi-Fi transceiver, a WiMAX transceiver, and/or other similar type of wireless transceiver configurable to communicate via a wireless network. Wireline interfaces 1008 can include one or more wireline transmitters, receivers, and/or transceivers, such as an Ethernet transceiver, a Universal Serial Bus (USB) transceiver, or similar transceiver configurable to communicate via a twisted pair, one or more wires, a coaxial cable, a fiber-optic link, or a similar physical connection to a wireline network.

In some embodiments, network communications interface module 1002 can be configured to provide reliable, secured, and/or authenticated communications. For each communication described herein, information for ensuring reliable communications (i.e., guaranteed message delivery) can be provided, perhaps as part of a message header and/or footer (e.g., packet/message sequencing information, encapsulation header(s) and/or footer(s), size/time information, and transmission verification information such as CRC and/or parity check values). Communications can be made secure (e.g., be encoded or encrypted) and/or decrypted/decoded using one or more cryptographic protocols and/or algorithms, such as, but not limited to, DES, AES, RSA, Diffie-Hellman, and/or DSA. Other cryptographic protocols and/or algorithms can be used as well or in addition to those listed herein to secure (and then decrypt/decode) communications.

Processors 1003 can include one or more general purpose processors and/or one or more special purpose processors (e.g., digital signal processors, application specific integrated circuits, etc.). Processors 1003 can be configured to execute computer-readable program instructions 1006 contained in data storage 1004 and/or other instructions as described herein. Data storage 1004 can include one or more computer-readable storage media that can be read and/or accessed by at least one of processors 1003. The one or more computer-readable storage media can include volatile and/or non-volatile storage components, such as optical, magnetic, organic or other memory or disc storage, which can be integrated in whole or in part with at least one of processors 1003. In some embodiments, data storage 1004 can be implemented using a single physical device (e.g., one optical, magnetic, organic or other memory or disc storage unit), while in other embodiments, data storage 1004 can be implemented using two or more physical devices.

Data storage 1004 can include computer-readable program instructions 1006 and perhaps additional data. For example, in some embodiments, data storage 1004 can store part or all of data utilized by a protein design system and/or a protein database; e.g., protein designs system 902, protein database 908. In some embodiments, data storage 1004 can additionally include storage required to perform at least part of the herein-described methods and techniques and/or at least part of the functionality of the herein-described devices and networks.

FIG. 9B depicts a network 906 of computing clusters 1009 a, 1009 b, 1009 c arranged as a cloud-based server system in accordance with an example embodiment. Data and/or software for protein design system 902 can be stored on one or more cloud-based devices that store program logic and/or data of cloud-based applications and/or services. In some embodiments, protein design system 902 can be a single computing device residing in a single computing center. In other embodiments, protein design system 902 can include multiple computing devices in a single computing center, or even multiple computing devices located in multiple computing centers located in diverse geographic locations.

In some embodiments, data and/or software for protein design system 902 can be encoded as computer readable information stored in tangible computer readable media (or computer readable storage media) and accessible by client devices 904 a, 904 b, and 904 c, and/or other computing devices. In some embodiments, data and/or software for protein design system 902 can be stored on a single disk drive or other tangible storage media, or can be implemented on multiple disk drives or other tangible storage media located at one or more diverse geographic locations.

FIG. 9B depicts a cloud-based server system in accordance with an example embodiment. In FIG. 9B, the functions of protein design system 902 can be distributed among three computing clusters 1009 a, 1009 b, and 1008 c. Computing cluster 1009 a can include one or more computing devices 1000 a, cluster storage arrays 1010 a, and cluster routers 1011 a connected by a local cluster network 1012 a. Similarly, computing cluster 1009 b can include one or more computing devices 1000 b, cluster storage arrays 1010 b, and cluster routers 1011 b connected by a local cluster network 1012 b. Likewise, computing cluster 1009 c can include one or more computing devices 1000 c, cluster storage arrays 1010 c, and cluster routers 1011 c connected by a local cluster network 1012 c.

In some embodiments, each of the computing clusters 1009 a, 1009 b, and 1009 c can have an equal number of computing devices, an equal number of cluster storage arrays, and an equal number of cluster routers. In other embodiments, however, each computing cluster can have different numbers of computing devices, different numbers of cluster storage arrays, and different numbers of cluster routers. The number of computing devices, cluster storage arrays, and cluster routers in each computing cluster can depend on the computing task or tasks assigned to each computing cluster.

In computing cluster 1009 a, for example, computing devices 1000 a can be configured to perform various computing tasks of protein design system 902. In one embodiment, the various functionalities of protein design system 902 can be distributed among one or more of computing devices 1000 a, 1000 b, and 1000 c. Computing devices 1000 b and 1000 c in computing clusters 1009 b and 1009 c can be configured similarly to computing devices 1000 a in computing cluster 1009 a. On the other hand, in some embodiments, computing devices 1000 a, 1000 b, and 1000 c can be configured to perform different functions.

In some embodiments, computing tasks and stored data associated with protein design system 902 can be distributed across computing devices 1000 a, 1000 b, and 1000 c based at least in part on the processing requirements of protein design system 902, the processing capabilities of computing devices 1000 a, 1000 b, and 1000 c, the latency of the network links between the computing devices in each computing cluster and between the computing clusters themselves, and/or other factors that can contribute to the cost, speed, fault-tolerance, resiliency, efficiency, and/or other design goals of the overall system architecture.

The cluster storage arrays 1010 a, 1010 b, and 1010 c of the computing clusters 1009 a, 1009 b, and 1009 c can be data storage arrays that include disk array controllers configured to manage read and write access to groups of hard disk drives. The disk array controllers, alone or in conjunction with their respective computing devices, can also be configured to manage backup or redundant copies of the data stored in the cluster storage arrays to protect against disk drive or other cluster storage array failures and/or network failures that prevent one or more computing devices from accessing one or more cluster storage arrays.

Similar to the manner in which the functions of protein design system 902 can be distributed across computing devices 1000 a, 1000 b, and 1000 c of computing clusters 1009 a, 1009 b, and 1009 c, various active portions and/or backup portions of these components can be distributed across cluster storage arrays 1010 a, 1010 b, and 1010 c. For example, some cluster storage arrays can be configured to store one portion of the data and/or software of protein design system 902, while other cluster storage arrays can store a separate portion of the data and/or software of protein design system 902. Additionally, some cluster storage arrays can be configured to store backup versions of data stored in other cluster storage arrays.

The cluster routers 1011 a, 1011 b, and 1011 c in computing clusters 1009 a, 1009 b, and 1009 c can include networking equipment configured to provide internal and external communications for the computing clusters. For example, the cluster routers 1011 a in computing cluster 1009 a can include one or more internet switching and routing devices configured to provide (i) local area network communications between the computing devices 1000 a and the cluster storage arrays 1001 a via the local cluster network 1012 a, and (ii) wide area network communications between the computing cluster 1009 a and the computing clusters 1009 b and 1009 c via the wide area network connection 1013 a to network 906. Cluster routers 1011 b and 1011 c can include network equipment similar to the cluster routers 1011 a, and cluster routers 1011 b and 1011 c can perform similar networking functions for computing clusters 1009 b and 1009 b that cluster routers 1011 a perform for computing cluster 1009 a.

In some embodiments, the configuration of the cluster routers 1011 a, 1011 b, and 1011 c can be based at least in part on the data communication requirements of the computing devices and cluster storage arrays, the data communications capabilities of the network equipment in the cluster routers 1011 a, 1011 b, and 1011 c, the latency and throughput of local networks 1012 a, 1012 b, 1012 c, the latency, throughput, and cost of wide area network links 1013 a, 1013 b, and 1013 c, and/or other factors that can contribute to the cost, speed, fault-tolerance, resiliency, efficiency and/or other design goals of the moderation system architecture.

Nanostructures and Proteins

The present invention provides synthetic nanostructures comprising

(a) a plurality of first proteins that self-interact to form a first multimeric substructure comprising at least one axis of rotational symmetry;

(b) a plurality of second proteins that self-interact to form a second multimeric substructure comprising at least one axis of rotational symmetry;

wherein multiple copies of the first multimeric substructure and the second multimeric substructure interact with each other at symmetrically repeated, non-natural, non-covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group.

The nanostructures of the invention can be used for any suitable purpose, including but not limited to delivery vehicles, as the nanostructures can encapsulate molecules of interest and/or the first and second proteins can be modified to bind to molecules of interest (diagnostics, therapeutics, detectable molecules for imaging and other applications, etc.)

The nanostructures of the invention are synthetic, in that they are not naturally occurring. The first protein and the second protein are non-naturally occurring proteins that can be produced by any suitable means, including recombinant production or chemical synthesis. Each member of the plurality of first proteins is identical to each other, and each member of the plurality of second proteins is identical to each other. The first proteins and the second proteins are different. There are no specific primary amino acid sequence requirements for the first and second proteins. As described in detail herein, the inventors disclose methods for designing the synthetic nanostructures of the invention, where the nanostructures are not dependent on specific primary amino acid sequences of the first and second proteins that make up the multimeric structures that interact to form the nanostructures of the invention. As will be understood by those of skill in the art, the design methods of the invention can produce a wide variety of nanostructures made of a wide variety of subunit proteins, and the methods are in no way limited to the subunit proteins disclosed herein.

As used herein, a “plurality” means at least two; in various embodiments, there are at least 2, 3, 4, 5, 6 or more first proteins in the first multimeric substructure and second proteins in the second multimeric substructure.

The number of first proteins in the first multimeric substructure may be the same or different than the number of second proteins in the second multimeric substructure. In one exemplary embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a dimer of the second protein. In a further exemplary embodiment, the first multimeric substructure comprises a trimer of the first protein, and wherein the second multimeric substructure comprises a trimer of the second protein.

The first and second proteins may be of any suitable length for a given purpose of the resulting nanostructure. In one embodiment, the first protein and the second protein are typically between 30-250 amino acids in length; the length of the first protein and the second protein may be the same or different. In various further embodiments, the first protein and the second protein are between 30-225, 30-200, 30-175, 50-250, 50-225, 50-200, 50-175, 75-250, 75-225, 75-200, 75-175, 100-250, 100-225, 100-200, 100-175, 125-250, 125-225, 125-200, 125-175, 150-250, 150-225, 150-200, and 150-175 amino acids in length.

In another embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 11) and T32-28B SEQ ID NO: 12);

(b) T33-09A SEQ ID NO: 13) and T33-09B SEQ ID NO: 14);

(c) T33-15A SEQ ID NO: 15) and T33-15B SEQ ID NO: 16);

(d) T33-21A SEQ ID NO: 17) and T33-21B SEQ ID NO: 18); and

(e) T33-28A SEQ ID NO: 19) and T33-28B SEQ ID NO: 20).

FIGS. 10-19 show the primary amino acid sequences of the proteins noted and allowable substitutions. Each figure includes four columns, which show:

1) The residue position in the protein

2) The identity of that residue in the designed sequence

3) The allowed amino acids at that position within our genus (labeled 1-4, indicating the AAs at that position in the different SEQ ID NOs for the relevant protein); and

4) The solvent-accessible surface area (SASA) of that residue in crystal structures (T32-28, T33-15, T33-21, and T33-28) or computationally designed models (T33-09) of the nanostructures.

In some embodiments certain residues can be any amino acid residue (“any”); such residues with a solvent-accessible surface area of greater than 50 square Angstroms are defined as being present on the polypeptide surface, and thus can be substituted with a different amino acid as desired for a given purpose without disruption of protein structure or multimer assembly (for example, SEQ ID NOS:11-20). In various other embodiments, these same residues can be modified by conservative substitutions (for example, SEQ ID NOS:21-30).

As further shown in the table, certain other residues can only be substituted with conservative amino acid substitutions. Such residues have a solvent-accessible surface area of less than or equal to 50 square Angstroms and are present in the polypeptide interior, and thus can be modified only by conservative substitutions to maintain overall protein structure to permit multimer assembly. As used here, “conservative amino acid substitution” means that:

-   -   hydrophobic amino acids (Ala, Cys, Gly, Pro, Met, Sce, Sme, Val,         Ile, Leu) can only be substituted with other hydrophobic amino         acids;     -   hydrophobic amino acids with bulky side chains (Phe, Tyr, Trp)         can only be substituted with other hydrophobic amino acids with         bulky side chains;     -   amino acids with positively charged side chains (Arg, His, Lys)         can only be substituted with other amino acids with positively         charged side chains;     -   amino acids with negatively charged side chains (Asp, Glu) can         only be substituted with other amino acids with negatively         charged side chains; and     -   amino acids with polar uncharged side chains (Ser, Thr, Asn,         Gln) can only be substituted with other amino acids with polar         uncharged side chains.

Certain other residues in the proteins are invariant; these residues have one or more atoms within 5 Angstroms of one or more atoms across the interface between the first and second multimeric substructures, and are therefore directly involved in self-assembly.

As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn; N), aspartic acid (Asp; D), arginine (Arg; R), cysteine (Cys; C), glutamic acid (Glu; E), glutamine (Gln; Q), glycine (Gly; G), histidine (His; H), isoleucine (Ile; I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Trp; W), tyrosine (Tyr; Y), and valine (Val; V).

In a further embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 21) and T32-28B SEQ ID NO: 22);

(b) T33-09A SEQ ID NO: 23) and T33-09B SEQ ID NO: 24);

(c) T33-15A SEQ ID NO: 25) and T33-15B SEQ ID NO: 26);

(d) T33-21A SEQ ID NO: 27) and T33-21B SEQ ID NO: 28); and

(e) T33-28A SEQ ID NO: 29) and T33-28B SEQ ID NO: 30

In another embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 31) and T32-28B SEQ ID NO: 32);

(b) T33-09A SEQ ID NO: 33) and T33-09B SEQ ID NO: 34);

(c) T33-15A SEQ ID NO: 35) and T33-15B SEQ ID NO: 36);

(d) T33-21A SEQ ID NO: 37) and T33-21B SEQ ID NO: 38); and

(e) T33-28A SEQ ID NO: 39) and T33-28B SEQ ID NO: 40).

In one embodiment, the first protein and the second protein comprise or consist of proteins selected from the following pairs of first and second proteins:

(a) T32-28A (SEQ ID NO: 11, 21, or 31) and T32-28B SEQ ID NO: 12, 22, or 32), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 1 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

(b) T33-09A SEQ ID NO: 13, 23, or 33) and T33-09B SEQ ID NO: 14, 24, or 34), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 3 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 4;

(c) T33-15A SEQ ID NO: 15, 25, or 35) and T33-15B SEQ ID NO: 16, 26, or 36), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 5 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 6;

(d) T33-21A SEQ ID NO: 17, 27, or 37) and T33-21B SEQ ID NO: 18, 28, or 38), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 7 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 8; and

(e) T33-28A SEQ ID NO: 19, 29, or 39) and T33-28B SEQ ID NO: 20, 30, or 40), wherein the first protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 9 and the second protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 10.

In various further embodiments, the first and second proteins are at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identical to the amino acid sequence of the designed protein.

In various further embodiments, the first and second proteins comprise or consist of proteins selected from the following pairs of first and second proteins

(a) T32-28A (SEQ ID NO: 1) MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHAL KNNPNGFPEGFWMPYLTIAYALANADTGAIKTGTLMPMVADDGPHYGANI AMEKDKKGGFGVGTYALTFLISNPEKQGFGRHVDEETGVGKWFEPFVVTY FFKYTGTPK;  and T32-28B (SEQ ID NO: 2) MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTISPGKFLLMLGGDI GAIQQAIETGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIV ETWSVAACISAADLAVKGSNVTLVRVHMAFGIGGKCYMVVAGDVLDVAAA VATASLAAGAKGLLVYASIIPRPHEAMWRQMVEG; (b) T33-09A (SEQ ID NO: 3) MEEVVLITVPSALVAVKIAHALVEERLAACVNIVPGLTSIYRWQGSVVSD HELLLLVKTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRE NTG;  and T33-09B (SEQ ID NO: 4) MVRGIRGAITVEEDTPAAILAATIELLLKMLEANGIQSYEELAAVIFTVT EDLTSAFPAEAARLIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQ DRVRHVYLNEAVRLRPDLESAQ; (c) T33-15A (SEQ ID NO: 5) MSKAKIGIVTVSDRASAGITADISGKAIILALNLYLTSEWEPIYQVIPDE QDVIETTLIKMADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFG ELMRAESLKEVPTAILSRQTAGLRGDSLIVNLPGDPASISDCLLAVFPAI PYCIDLMEGPYLECNEAMIKPFRPKAK;  and T33-15B (SEQ ID NO: 6) MVRGIRGAITVNSDTPTSIIIATILLLEKMLEANGIQSYEELAAVIFTVT EDLTSAFPAEAARQIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQ DRVRHVYLSEAVRLRPDLESAQ; (d) T33-21A (SEQ ID NO: 7) MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDE EMKGILEEIQNDIYKIMGEIGSKGKIEGISEERIAWLLKLILRYMEMVNL KSFVLPGGTLESAKLDVCRTIARRALRKVLTVTREFGIGAEAAAYLLALS DLLFLLARVIEIEKNKLKEVRS;  and T33-21B (SEQ ID NO: 8) MPHLVIEATANLRLETSPGELLEQANKALFASGQFGEADIKSRFVTLEAY RQGTAAVERAYLHACLSILDGRDIATRTLLGASLCAVLAEAVAGGGEEGV QVSVEVREMERLSYAKRVVARQR;  and (e) T33-28A (SEQ ID NO: 9) MESVNTSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMIG VRAAQIFLGDDTEDGFKGPHIRIRCVDIDDKHTYNAMVYVDLIVGTGASE VERETAEEEAKLALRVALQVDIADEHSCVTQFEMKLREELLSSDSFHPDK DEYYKDFL;  and T33-28B (SEQ ID NO: 10) MPVIQTFVSTPLDHHKRLLLAIITYRIVTRVVLGKPEDLVMMTFHDSTPM HFFGSTDPVACVRVEALGGYGPSEPEKVTSIVTAAITAVCGIVADRIFVL YFSPLHCGWNGTNF.

As shown in the examples that follow, these non-naturally occurring protein pairs self-interact to form multimeric substructures, which can interact to form the nanostructures of the invention. As will be understood by those of skill in the art, the design methods of the invention can produce a wide variety of nanostructures made of a wide variety of subunit proteins, and the methods are in no way limited to these particular protein pairs; they are merely exemplary.

The plurality of the first proteins self-interact to form a first multimeric substructure and the plurality of the second proteins self-interact to form a second multimeric substructure, where each multimeric substructure comprises at least one axis of rotational symmetry. As will be understood by those of skill in the art, the self-interaction is a non-covalent protein-protein interaction. Any suitable non-covalent interaction(s) can drive self-interaction of the proteins to form the multimeric substructure, including but not limited to one or more of electrostatic interactions, π-effects, van der Waals forces, hydrogen bonding, and hydrophobic effects. The self-interaction in each of the two different multimeric substructures may be natural or synthetic in origin; that is, the synthetic proteins making up the nanostructures of the invention may be synthetic variations of natural proteins that self-interact to form multimeric substructures, or they may be fully synthetic proteins that have no amino acid sequence relationships to known natural proteins.

As used herein, “at least one axis of rotational symmetry” means at least one axis of symmetry around which the substructure can be rotated without changing the appearance of the substructure. In one embodiment, one or both of the substructures have cyclic symmetry, meaning rotation about a single axis (for example, a three-fold axis in the case of a trimeric protein; generally, multimeric substructures with n subunits and cyclic symmetry will have n-fold rotational symmetry, sometimes denoted as C_(n) symmetry). In other embodiments, one or both substructures possess symmetries comprising multiple rotational symmetry axes, including but not limited to dihedral symmetry (cyclic symmetry plus an orthogonal two-fold rotational axis) and the cubic point group symmetries including tetrahedral, octahedral, and icosahedral point group symmetry (multiple kinds of rotational axes). The first multimeric substructure and the second multimeric substructure may comprise the same or different rotational symmetry properties. In one non-limiting embodiment, the first multimeric substructure comprises a dimer, trimer, tetramer, or pentamer of the first protein, and wherein the second multimeric substructure comprises a dimer or trimer of the second protein. In a further non-limiting embodiment, the first multimeric protein comprises a trimeric protein, and the second multimeric protein comprises a dimeric protein. In another non-limiting embodiment,

the first multimeric protein comprises a trimeric protein, and the second multimeric protein comprises a different trimeric protein.

In the nanostructures of the invention, there are at least two identical copies of the first multimeric substructure and at least two identical copies of the second multimeric substructure in the nanostructure. In general, the number of copies of each of the first and second multimeric substructures is dictated by the number of symmetry axes in the designated mathematical symmetry group of the nanostructure that match the symmetry axes in each multimeric substructure. This relationship arises from the requirement that the symmetry axes of each copy of each multimeric substructure must be aligned to symmetry axes of the same kind in the synthetic nanostructure. By way of non-limiting example, a synthetic nanostructure with tetrahedral point group symmetry can comprise exactly four copies of a first trimeric substructure aligned along the exactly four three-fold symmetry axes passing through the center and vertices of a tetrahedron. Likewise, the same non-limiting example tetrahedral nanostructure can comprise six (but not five, seven, or any other number) copies of a dimeric substructure aligned along the six two-fold symmetry axes passing through the center and edges of the tetrahedron (an example of a synthetic nanostructure with this symmetric architecture, referred to here as T32, is shown in FIG. 3F). In general, although every copy of each multimeric substructure must have its symmetry axes aligned to symmetry axes of the same kind in the synthetic nanostructure, not all symmetry axes in the synthetic nanostructure must have a multimeric building block aligned to them. By way of non-limiting example, we can consider a synthetic nanostructure with icosahedral point group symmetry comprising multiple copies of each of a first multimeric substructure and a second multimeric substructure. There are 30 two-fold, 20 three-fold, and 12 five-fold rotational symmetry axes in icosahedral point group symmetry. The nanostructures of the invention are those in which two different multimeric substructures are aligned along all instances of two types of symmetry axes in a designated mathematical symmetry group. Therefore, the nanostructures in this non-limiting example could include icosahedral nanostructures comprising 30 dimeric substructures and 20 trimeric substructures, or 30 dimeric substructures and 12 pentameric substructures, or 20 trimeric substructures and 12 pentameric substructures. In each case, one of the three types of symmetry axes is left unoccupied by multimeric substructures.

The interaction between the first and second multimeric substructures is a non-natural (e.g., not an interaction seen in a naturally occurring protein multimer), non-covalent interaction; this can comprise any suitable non-covalent interaction(s), including but not limited to one or more of electrostatic interactions, π-effects, van der Waals forces, hydrogen bonding, and hydrophobic effects. The interaction occurs at multiple identical interfaces (symmetrical) between the first and second multimeric substructures, wherein the interfaces can be continuous or discontinuous. This symmetric repetition of the non-covalent protein-protein interfaces between the first and second multimeric substructures results from the overall symmetry of the subject nanostructures; because each protein molecule of each of the first and second multimeric substructures is in a symmetrically equivalent position in the nanostructure, the interactions between them are also symmetrically equivalent.

Non-covalent interactions between the first multimeric substructures and the second multimeric substructures orient the substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group as described above. This feature provides for the formation of regular, defined nanostructures, as opposed to irregular or imprecisely defined structures or aggregates. Several structural features of the non-covalent interactions between the first multimeric substructures and the second multimeric substructures help to provide a specific orientation between substructures. Generally, large interfaces that are complementary both chemically and geometrically and comprise many individually weak atomic interactions tend to provide highly specific orientations between protein molecules. In one embodiment of the subject invention, therefore, each symmetrically repeated instance of the non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure may bury between 1000-2000 Å² of solvent-accessible surface area (SASA) on the first multimeric substructure and the second multimeric substructure combined. SASA is a standard measurement of the surface area of molecules commonly used by those skilled in the art; many computer programs exist that can calculate both SASA and the change in SASA upon burial of a given interface for a given protein structure. A commonly used measure of the geometrical complementarity of protein-protein interfaces is the Shape Complementarity (S_(c)) value of Lawrence and Colman (J. Mol. Biol. 234:946-50 (1993)). In a further embodiment, each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure has an S_(c) value between 0.5-0.8. Finally, in order to provide a specific orientation between the first multimeric substructures and the second multimeric substructures, in many embodiments the interface between them may be formed by relatively rigid portions of each of the protein substructures. This feature ensures that flexibility within each protein molecule does not lead to imprecisely defined orientations between the first and second multimeric substructures. Secondary structures in proteins, that is alpha helices and beta strands, generally make a large number of atomic interactions with the rest of the protein structure and therefore occupy a rigidly fixed position. Therefore, in one embodiment, at least 50% of the atomic contacts comprising each symmetrically repeated, non-natural, non-covalent protein-protein interface between the first multimeric substructure and the second multimeric substructure are formed from amino acid residues residing in elements of alpha helix and/or beta strand secondary structure.

The nanostructures of the invention are capable of forming a variety of different structural classes based on the designated mathematical symmetry group of each nanostructure. As the teachings above indicate, the nanostructures comprise multiple copies of a first multimeric substructure and multiple copies of a second multimeric substructure that interact at one or more symmetrically repeated, non-covalent protein-protein interfaces that orient the first multimeric substructures and the second multimeric substructures such that their symmetry axes are aligned with symmetry axes of the same kind in a designated mathematical symmetry group. There are many symmetry groups that comprise multiple types of symmetry axes, including but not limited to dihedral symmetries, cubic point group symmetries, line or helical symmetries, plane or layer symmetries, and space group symmetries. Collectively, the nanostructures of the invention may possess any symmetry that comprises at least two types of symmetry axes; however, each individual nanostructure possesses a single, mathematically defined symmetry that results from the interface between the first and second multimeric substructures orienting them such that their symmetry axes align to those in a designated mathematically symmetry group. Individual nanostructures possessing different symmetries may find use in different applications; for instance, nanostructures possessing cubic point group symmetries may form hollow shell- or cage-like structures that could be useful, for example, for packaging or encapsulating molecules of interest, while nanostructures possessing plane group symmetries will tend to form regularly repeating two-dimensional protein layers that could be used, for example, to array molecules, nanostructures, or other functional elements of interest at regular intervals.

In one embodiment, the mathematical symmetry group is selected from the group consisting of tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry.

As will be apparent to those of skill in the art, the ability to widely modify surface amino acid residues without disruption of the protein structure permits many types of modifications to endow the resulting self-assembled multimers with a variety of functions. In one non-limiting embodiment, the protein can be modified to facilitate covalent linkage to a “cargo” of interest. In one non-limiting example, the protein can be modified, such as by introduction of various cysteine residues at defined positions to facilitate linkage to one or more antigens of interest, such that an assembly of the protein would provide a scaffold to provide a large number of antigens for delivery as a vaccine to generate an improved immune response (similar to the use of virus-like particles). In another non-limiting embodiment, the protein of the invention may be modified by linkage (covalent or non-covalent) with a moiety to help facilitate “endosomal escape.” For applications that involve delivering molecules of interest to a target cell, such as targeted delivery, a critical step can be escape from the endosome—a membrane-bound organelle that is the entry point of the delivery vehicle into the cell. Endosomes mature into lysosomes, which degrade their contents. Thus, if the delivery vehicle does not somehow “escape” from the endosome before it becomes a lysosome, it will be degraded and will not perform its function. There are a variety of lipids or organic polymers that disrupt the endosome and allow escape into the cytosol. Thus, in this embodiment, the first or second protein can be modified, for example, by introducing cysteine residues that will allow chemical conjugation of such a lipid or organic polymer to the monomer or resulting multimer surface.

In a further aspect, the present invention provides isolated proteins, comprising or consisting of an amino acid sequence selected from the group consisting of

(a) T32-28A (SEQ ID NO: 11);

(b) T32-28B SEQ ID NO: 12);

(c) T33-09A SEQ ID NO: 13);

(d) T33-09B SEQ ID NO: 14);

(e) T33-15A SEQ ID NO: 15);

(f) T33-15B SEQ ID NO: 16);

(g) T33-21A SEQ ID NO: 17);

(h) T33-21B SEQ ID NO: 18);

(i) T33-28A SEQ ID NO: 19); and

(j) T33-28B SEQ ID NO: 20).

The isolated proteins of the invention can be used, for example, to prepare the nanostructures of the invention. In some embodiments, the isolated proteins may be produced in the same time and place; for instance, they may be expressed recombinantly in the same bacterial or eukaryotic cell. In other embodiments, each protein may be produced separately from the other, either by recombinant expression in separate bacterial or eukaryotic cells or by protein synthesis in separate vessels. The isolated proteins of the invention can be modified in a number of ways, including but not limited to the ways described above, either before or after assembly of the nanostructures of the invention. As a non-limiting example, the T33-15A protein and the T33-15B protein could be produced by recombinant expression in separate cultures of bacterial cells and purified independently of one another. Prior to mixing the two proteins, each protein could be modified chemically to introduce additional functionality as described above. The modified proteins could then be mixed to initiate assembly of a modified T33-15 nanostructure that comprises multiple copies of each of the T33-15A and T33-15B proteins. Alternatively, the T33-15A and T33-15B proteins could be produced recombinantly in the same cell to produce the assembled T33-15 nanostructure of the invention, which could then be modified as desired.

FIGS. 10-19 show the primary amino acid sequences of the proteins noted and allowable substitutions, as discussed above. In another embodiment, the isolated proteins comprise or consist of an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 21);

(b) T32-28B SEQ ID NO: 22);

(c) T33-09A SEQ ID NO: 23);

(d) T33-09B SEQ ID NO: 24);

(e) T33-15A SEQ ID NO: 25);

(f) T33-15B SEQ ID NO: 26);

(g) T33-21A SEQ ID NO: 27);

(h) T33-21B SEQ ID NO: 28);

(i) T33-28A SEQ ID NO: 29); and

(j) T33-28B SEQ ID NO: 30).

In another embodiment, the isolated proteins comprise or consist of an amino acid sequence selected from the group consisting of:

(a) T32-28A (SEQ ID NO: 31);

(b) T32-28B SEQ ID NO: 32);

(c) T33-09A SEQ ID NO: 32);

(d) T33-09B SEQ ID NO: 34);

(e) T33-15A SEQ ID NO: 35);

(f) T33-15B SEQ ID NO: 36);

(g) T33-21A SEQ ID NO: 37);

(h) T33-21B SEQ ID NO: 38);

(i) T33-28A SEQ ID NO: 39); and

(j) T33-28B SEQ ID NO: 40).

In another embodiment, the isolated proteins comprise or consist of an amino acid sequence:

(A) T32-28A (SEQ ID NO: 11, 21, or 31), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 1;

(B) T32-28B SEQ ID NO: 12, 22, or 32), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 2;

(C) T33-09A SEQ ID NO: 13, 23, or 33), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 3;

(D) T33-09B SEQ ID NO: 14, 24, or 34), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 4;

(E) T33-15A SEQ ID NO: 15, 25, or 35), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 5;

(F) T33-15B SEQ ID NO: 16, 26, or 36), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 6;

(G) T33-21A SEQ ID NO: 17, 27, or 37), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 7;

(H) T33-21B SEQ ID NO: 18, 28, or 38), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 8;

(I) T33-28A SEQ ID NO: 19, 29, or 39), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 9; and

(J) T33-28B SEQ ID NO: 20, 30, or 40), wherein the protein is at least 70% identical to the amino acid sequence of SEQ ID NO: 10.

In various further embodiments, the protein of any one of (A)-(J) is at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identical to the amino acid sequence of the designed protein.

In a further embodiment, the isolated protein comprises or consists of an amino acid sequence selected from the group consisting of: SEQ ID NOS: 1-10.

As used throughout the present application, the term “protein” is used in its broadest sense to refer to a sequence of subunit amino acids. The polypeptides of the invention may comprise L-amino acids, D-amino acids (which are resistant to L-amino acid-specific proteases in vivo), or a combination of D- and L-amino acids. The polypeptides described herein may be chemically synthesized or recombinantly expressed. The polypeptides may be linked to any other moiety as deemed useful for a given purpose. Such linkage can be covalent or non-covalent as is understood by those of skill in the art.

In one non-limiting embodiment, the protein can be modified to facilitate covalent linkage to a “cargo” of interest. In one non-limiting example, the protein can be modified, such as by introduction of various cysteine residues at defined positions to facilitate linkage to one or more antigens of interest, such that an assembly of the protein would provide a scaffold to provide a large number of antigens for delivery as a vaccine to generate an improved immune response (similar to the use of virus-like particles). In another non-limiting embodiment, the protein of the invention may be modified by linkage (covalent or non-covalent) with a moiety to help facilitate “endosomal escape.”

In a further aspect, the present invention provides multimers, comprising a plurality of identical protein monomers according to any embodiment or combination of embodiments of the proteins of the invention. As is disclosed herein, proteins of the invention are capable of self-interacting into multimeric substructures (i.e.: dimers, trimers, hexamers, pentamers, hexamers, etc.) formed from self-assembly of a plurality of a single protein monomer of the invention (i.e., “homo-multimeric assemblies”). As used herein, a “plurality” means 2 or more. In various embodiments, the multimeric assembly comprises 2, 3, 4, 5, 6, or more identical protein monomers. The multimeric assemblies can be used for any purpose, including but not limited to creating the nanostructures of the present invention.

In another aspect, the present invention provides isolated nucleic acids encoding a protein of the present invention. The isolated nucleic acid sequence may comprise RNA or DNA. As used herein, “isolated nucleic acids” are those that have been removed from their normal surrounding nucleic acid sequences in the genome or in cDNA sequences. Such isolated nucleic acid sequences may comprise additional sequences useful for promoting expression and/or purification of the encoded protein, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the proteins of the invention.

In a further aspect, the present invention provides recombinant expression vectors comprising the isolated nucleic acid of any embodiment or combination of embodiments of the invention operatively linked to a suitable control sequence. “Recombinant expression vector” includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product. “Control sequences” operably linked to the nucleic acid sequences of the invention are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered “operably linked” to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors can be of any type known in the art, including but not limited to plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The construction of expression vectors for use in transfecting prokaryotic cells is also well known in the art, and thus can be accomplished via standard techniques. (See, for example, Sambrook, Fritsch, and Maniatis, in: Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Laboratory Press, 1989; Gene Transfer and Expression Protocols, pp. 109-128, ed. E. J. Murray, The Humana Press Inc., Clifton, N.J.), and the Ambion 1998 Catalog (Ambion, Austin, Tex.). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In a preferred embodiment, the expression vector comprises a plasmid. However, the invention is intended to include other expression vectors that serve equivalent functions, such as viral vectors.

In another aspect, the present invention provides host cells that have been transfected with the recombinant expression vectors disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cells can be transiently or stably transfected. Such transfection of expression vectors into prokaryotic and eukaryotic cells can be accomplished via any technique known in the art, including but not limited to standard bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, polycationic mediated-, or viral mediated transfection. (See, for example, Molecular Cloning: A Laboratory Manual (Sambrook, et al., 1989, Cold Spring Harbor Laboratory Press; Culture of Animal Cells: A Manual of Basic Technique, 2^(nd) Ed. (R. I. Freshney. 1987. Liss, Inc. New York, N.Y.). A method of producing a polypeptide according to the invention is an additional part of the invention. The method comprises the steps of (a) culturing a host according to this aspect of the invention under conditions conducive to the expression of the polypeptide, and (b) optionally, recovering the expressed polypeptide.

In a further aspect, the present invention provides kits comprising:

-   -   (a) one or more of the isolated proteins, multimeric protein         assemblies, oir nanostructures of the invention;     -   (b) one or more recombinant nucleic acids of the invention;     -   (c) one or more recombinant expression vectors comprising         recombinant nucleic acids of the invention; and/or     -   (d) one or more recombinant host cell, comprising recombinant         expression vectors of the invention.

Nanostructure and Protein Examples

Two example distinct tetrahedral architectures have been considered in detail: the T33 architecture described above and the T32 architecture shown in FIGS. 2 and 3F, in which the materials are formed from four trimeric and six dimeric building blocks aligned along the three-fold and two-fold tetrahedral symmetry axes. In an experiment, all pairwise combinations of a set of 1,161 dimeric and 200 trimeric protein building blocks of known structure were docked in the T32 and T33 architectures. This resulted in a large set of potential novel nanomaterials: 232,200 and 19,900 docked protein pairs, respectively, with a given pair often yielding several distinct promising docked configurations. Interface sequence design calculations were carried out on the 1,000 highest scoring docked configurations in each architecture, and the designs were evaluated based on the predicted binding energy, shape complementarity, size, and number of buried unsatisfied hydrogen bonding groups (vide supra).

After filtering on these criteria, 30 T32 and 30 T33 materials were selected for experimental characterization. The 60 designs were derived from 39 distinct trimeric and 19 dimeric proteins, and contained an average of 19 amino acid mutations per pair of subunits compared to the native sequences. The designed interfaces reside mostly on elements of secondary structure, both α-helices and β-strands, with nearby loops often making minor contributions.

Synthetic genes encoding each designed pair of proteins were cloned in tandem in a single expression vector to allow inducible co-expression in E. coli. Polyacrylamide gel electrophoresis (PAGE) under non-denaturing (native) conditions was used to rapidly screen the assembly state of the designed proteins in clarified cell lysates. Several designed protein pairs yielded single bands that migrated more slowly than the wild-type proteins from which they were derived, suggesting assembly to higher-order species. These proteins were subcloned to introduce a hexahistidine tag at the C terminus of one of the two subunits and purified by nickel affinity chromatography and size exclusion chromatography (SEC). Five pairs of designed proteins, one T32 design (T32-28) and four T33 designs (T33-09, T33-15, T33-21, and T33-28), co-purified off of the nickel column and yielded dominant peaks at the expected size of approximately 24 subunits when analyzed by SEC, such as shown in FIG. 5A.

FIG. 5A shows SEC chromatograms of the designed pairs of proteins (solid lines) and the wild-type oligomeric proteins from which they were derived (dashed and dotted lines). The co-expressed designed proteins elute at the volumes expected for the target 24-subunit nanomaterials, while the wild-type proteins elute as dimers or trimers. The T33-15 in vitro panel shows chromatograms for the individually produced and purified designed components (T33-15A and T33-15B [dashed and dotted lines]) as well as a stoichiometric mixture of the two components (solid line).

FIG. 5B shows a native PAGE analysis of in vitro-assembled T32-28 (left panel) and T33-15 (right panel) in cell lysates. In FIG. 5B, lysates of the co-expressed design components (lanes 5-6) contain slowly migrating species (arrows) not present in lysates of the wild-type and individually expressed components (lanes 1-4). Mixing equal volumes (e.v.) of crude lysates containing the individual designed components yields the same assemblies (lane 7), although some unassembled building blocks remain due to unequal levels of expression (particularly for T33-15). When the differences in expression levels are accounted for by mixing adjusted volumes of lysates (a.v.), more efficient assembly is observed (lane 8).

FIGS. 5C-5G respectively show native PAGE analyses of in vitro-assembled T32-28, T33-09, T33-15, T33-21, and T33-28 in cell lysates. In FIGS. 5C-5G, lane 1 is from cells expressing the wild-type scaffold for component A and lane 2 the wild-type scaffold for component B. Lanes 3-4 are from cells expressing the individual design components and lanes 5-6 the co-expressed components. Lanes 7-8 are from samples mixed as crude lysates (cr.e.v or cr.a.v), while lanes 9-10 are from samples mixed as cleared lysates (cl.e.v. or cl.a.v.). Lanes 7 and 9 are from lysates mixed with equal volumes (cr.e.v. or cl.e.v.), while lanes 8 and 10 are from lysates mixed with adjusted volumes (cr.a.v. or cl.a.v.). Lane 5 is from cells expressing the C-terminally A1-tagged constructs; all other lanes are from cells expressing the C-terminally His-tagged constructs. An arrow is positioned next to each gel indicating the migration of 24-subunit assemblies and the gel regions containing unassembled building blocks are bracketed. Each gel was stained with GelCode Blue (Thermo Scientific).

The ability of each material to assemble in vitro was tested by expressing the two components in separate E. coli cultures and mixing them at various points after cell lysis. Native PAGE revealed that in two cases, T33-15 shown in FIG. 5E and T32-28 shown in FIG. 5G, the two separately expressed components efficiently assembled to the designed materials in vitro when equal volumes of cell lysates were mixed as indicated in FIGS. 5B, 5C, and 5E. Adjusting the volume of each lysate in the mixture to account for differences in the level of soluble expression of the two components allowed for more quantitative assembly. In the case of T33-15, the two components of the material could also be purified independently: T33-15A and T33-15B each eluted from the SEC column as trimers in isolation. After mixing the two purified components in a 1:1 molar ratio and two hour incubation at room temperature, the mixture eluted from the SEC column as predominantly the 24mer assembly, with small amounts of residual trimeric building blocks remaining as shown in FIG. 5A. The assembly of our designed materials can thus be controlled by simply mixing the two components.

FIGS. 6A and 6B shows electron micrographs of designed two-component protein nanomaterials. FIG. 6A shows negative stain electron microscopy of the five designed materials confirmed that they assemble specifically to the target architectures (FIG. 2). For each material, fields of remarkably monodisperse particles of the expected size and symmetry were observed, confirming the homogeneity of the materials suggested by SEC. Particle averaging yielded images that recapitulate features of the computational design models at low resolution. For example, class averages of T33-09 revealed roughly square or triangle-shaped structures with well-defined internal cavities that closely resemble projections calculated from the computational design model along its two-fold and three-fold axes as shown in FIG. 2, T33-09 inset.

FIG. 6B shows electron micrographs of in vitro-assembled T33-15 (unpurified) and T33-15A and T33-15B in isolation. Negative stain electron micrographs of independently purified T33-15 components (left and middle panels) and unpurified, in vitro-assembled T33-15 (right panel) are shown to scale (scale bar: 25 nm). Micrographs of T33-15 assembled in vitro as described above were indistinguishable from those of co-expressed T33-15 as shown in FIGS. 6A and 6B, demonstrating that the same material is obtained using both methods

X-ray crystal structures were solved four of the designed materials (T32-28, T33-15, T33-21, and T33-28) to resolutions ranging from 2.1 to 4.5 Å. Table 3 provides crystallographic statistics for T32-28, T33-15, and T33-28 data collection and refinement, where statistics in parentheses refer to the highest resolution shell.

TABLE 3 T32-28 T33-15 T33-28 (PDB ID 4NWN) (PDB ID 4NWO) (PDB ID 4NWR) Wavelength (Å) 0.9793 0.9792 0.9793 Resolution range (Å) 93.93-4.5 (4.66-4.5) 75.49-2.8 (2.901-2.8) 94.21-3.5 (3.625-3.5) Space group P3₁21 F432 P2₁ Unit cell [a/b/c (Å)] 246.01 246.01 290.94 213.52 213.52 213.52 124.91 189.25 376.83 Unit cell [α/β/γ (°)] 90 90 120 90 90 90 90 90.02 90 Total reflections 436516 (44096) 146590 (14934) 808494 (85695) Unique reflections 59814 (5903) 10783 (1045) 217956 (21869) Multiplicity 7.3 (7.5) 13.6 (14.3) 3.7 (3.9) Completeness (%) 98.31 (97.93) 99.91 (100.00) 98.80 (99.57) Mean I/sigma(I) 13.20 (2.17) 19.80 (2.16) 8.95 (2.39) Wilson B-factor 184 79.32 90.49 R-merge 0.1383 (0.9457) 0.1234 (1.767) 0.144 (0.6014) R-meas 0.1492 0.1282 0.1683 CC1/2 0.997 (0.586) 0.999 (0.718) 0.994 (0.685) CC* 0.999 (0.859) 1 (0.914) 0.998 (0.902) R-work 0.2971 (0.3574) 0.2020 (0.3181) 0.2614 (0.3126) R-free 0.3429 (0.3937) 0.2515 (0.3765) 0.2987 (0.3639) Number of non-hydrogen 20307 2011 88861 atoms macromolecules 20307 2008 88861 ligands 0 1 0 water 0 2 0 Protein residues 4075 285 12686 RMS(bonds) 0.003 0.003 0.002 RMS(angles) 0.55 0.77 0.49 Ramachandran favored (%) 97 98 97 Ramachandran outliers (%) 0.15 0 0 Clashscore 0.89 2.26 4.61 Average B-factor 216.2 72.6 91.7 macromolecules 216.2 72.6 91.7 ligands 111.5 solvent 56.6

Table 4 shows crystallographic statistics for T33-21 data collection and refinement, with Statistics in parentheses refer to the highest resolution shell.

T33-21 R32 T33-21 F4₁32 (PDB ID 4NWP) (PDB ID 4NWQ) Wavelength (Å) 1.0393 0.9716 Resolution range (Å) 93.78-2.1 (2.175-2.1) 96.23-2.8 (2.9-2.8) Space group R32 F4₁32 Unit cell 113.35 113.35 634.88 272.18 272.18 272.18 [a/b/c (Å)] Unit cell 90 90 120 90 90 90 [α/β/γ (°)] Total reflections 901047 (89024) 431476 (43290) Unique reflections 92425 (9127) 21830 (2129) Multiplicity 9.7 (9.8) 19.8 (20.3) Completeness (%) 99.94 (99.97) 99.99 (99.95) Mean I/sigma(I) 14.46 (2.48) 20.89 (3.14) Wilson B-factor 37.68 69 R-merge 0.1123 (1.179) 0.1215 (1.203) R-meas 0.1187 0.1248 CC1/2 0.998 (0.749) 0.999 (0.878) CC* 1 (0.925) 1 (0.967) R-work 0.1879 (0.3925) 0.1815 (0.3340) R-free 0.2183 (0.4478) 0.1958 (0.3804) Number of 8248 2112 non-hydrogen atoms macromolecules 7882 2041 ligands 141 55 water 225 16 Protein residues 1046 269 RMS(bonds) 0.004 0.001 RMS(angles) 0.67 0.41 Ramachandran 100 99 favored (%) Ramachandran 0 0 outliers (%) Clashscore 1.87 1.2 Average B-factor 42.5 73.1 macromolecules 42.2 72.5 ligands 64.5 98.8 solvent 40.6 64

In the provided cases, the structures can reveal that the inter-building block interfaces were designed with high accuracy: comparing a pair of chains from each structure to the computationally designed model yields backbone root mean square deviations (RMSD) between 0.5 and 1.2 Å, as indicated on the right side of FIG. 7 and Table 5 below.

Global 2 Chain Contents of Design Crystal RMSD RMSD asymmetric Structure used for model Structure (Å) (Å) unit superposition T32-28 4NWN 2.586 1.246 One cage Asymmetric unit (24 subunits) T33-15 4NWO 1.433 0.876 One chain One cage of each generated by 32 component components (2 subunits) of F432 crystal symmetry T33-21 4NWP 1.962 0.924 4 chains One cage of each generated component from one (8 subunits) crystallographic 3-fold T33-21 4NWQ 1.482 0.765 One chain One cage of each generated from component crystallographic (2 subunits) 2-folds and 3-folds T33-28 4NWR 0.965 0.503 Four One complete complete cage from the cages asymmetric unit (96 subunits) T33-28 4NWR 0.965 0.548 Four One complete complete cage from the cages asymmetric unit (96 subunits) T33-28 4NWR 1.195 0.567 Four One complete complete cage from the cages asymmetric unit (96 subunits) T33-28 4NWR 1.212 0.477 Four One complete complete cage from the cages asymmetric unit (96 subunits)

For Table 5, global RMSDs were calculated over all 24 subunits of each design model and corresponding subunits in each crystal structure and 2 chain RMSDs were calculated over chains A and B of each design model and corresponding subunits in each crystal structure. 24 subunits composing one complete cage were derived from each crystal structure as indicated and the chains renamed to match the corresponding names in the design models. In the case of T33-28, four different sets of RMSD calculations were carried out; one for each of the four cages contained in the asymmetric unit of 4NWR.

In the structures with resolutions that permit detailed analysis of side chain configurations (T33-15 and two independent crystal forms of T33-21), 87/113 side chains at the designed interfaces can adopt the predicted conformations as indicated in Tables 6 and 7 below. Table 6 shows a side chain chi value comparison of T33-15 crystal structure (PDB ID 4NWO) with the design model. The numbers reported are the differences in the value of each side chain chi value for each amino acid resolved in the crystal structure.

TABLE 6 Residue Δchi1 Δchi2 Δchi3 Δchi4 Δchi5 I9 1.6 4.2 T10 4.8 V11 1.1 N12 −0.8 −3.4 S13 −8.3 T15 −0.5 P16 2.0 −0.7 −0.9 2.4 T17 135.3 S18 116.3 I20 1.3 −13.5 I21 4.8 −1.4 I24 −7.1 −4.4 L25 −1.1 −6.8 E28 −103.8 −14.8 79.8 K29 −16.6 — — — E32 — — — Q64 90.3 145.9 −25.5 I65 −1.8 1.7 R86 −5.2 −8.2 −1.3 −3.6 7.5 L108 1.6 −0.6 S109 3.9 E110 — — — T140 — I143 2.2 −9.0 K146 100.0 3.9 — — I149 −98.7 −11.1 L150 0.8 −9.1 N153 13.3 4.6 L154 3.5 3.4 E226 — — — S227 −1.3 K229 — — — — E230 — — — V231 — D255 −9.8 4.6 P256 −4.9 3.3 −0.2 −3.0 S258 −10.3 S260 0.5 D261 2.4 18.8 L264 103.8 108.6 N285 −6.7 −7.6 M288 −8.4 2.0 −10.2 I289 −4.5 −3.3 Pass 29 23 4 3 1 Fail 7 2 2 0 0

In Table 6, residue numbers refer to positions in the T33-15 design model, the “pass” values are the number of residues where |Δchi|≤25 m, and the “fail” values are the number of residues where |Achi|>25. Residues with missing atoms in the crystal structure, for which a Δchi value could not be determined, are indicated with a dash. All Δchi values are reported in degrees.

Table 7 shows side chain chi value comparison of T33-21 crystal structures (PDB IDs 4NWP and 4NWQ) with the design model.

TSBLE 7 T33-21 vs. 4NWP T33-21 vs. 4NWQ Residue¹ Δchi1 Δchi2 Δchi3 Δchi4 Δchi5 Δchi1 Δchi2 Δchi3 Δchi4 Δchi5 K52 — — — — 108.5 −0.1 0.7 −1.5 E54 1.4 — — 4.4 −7.1 2.8 S57 9.5 6.0 E58 13.8 — — 82.2 90.5 −38.0 E59 — — — −97.1 95.5 14.7 R60 −0.2 4.1 −4.2 18.9 −0.2 2.1 5.4 −8.1 −3.2 0.0 I61 0.7 −8.0 −6.4 −13.6 W63 3.0 −3.3 2.3 1.6 L65 −5.1 12.7 −11.2 −3.6 K66 110.0 −1.8 121.6 120.9 106.5 −25.5 −119.7 1.4 I68 2.4 −10.0 −1.1 −12.1 L69 0.1 3.0 −3.0 −1.3 M72 33.0 −0.3 2.5 1.1 −3.8 11.4 L102 −4.7 9.6 −4.7 8.0 R103 −7.4 −5.5 −5.4 −3.4 0.3 −7.8 −5.2 −5.3 −1.5 0.7 L106 −2.0 −10.6 −4.8 −13.7 T109 −7.4 −9.8 R110 −102.3 14.3 −4.1 39.6 0.1 −106.9 19.8 −1.9 17.6 0.4 I114 −6.7 −20.3 −4.2 4.6 E117 0.3 168.2 −55.5 −16.4 160.2 5.0 L123 −2.2 1.4 −2.2 1.8 D127 −9.4 2.5 −9.8 14.0 D145 — — — — K175 3.5 — — — −1.8 −5.0 2.6 −0.6 D221 13.3 −21.7 19.4 −17.7 I222 109.5 110.2 100.6 −9.5 T224 −4.6 −4.7 R225 −15.7 −6.2 0.4 −9.4 1.0 −12.3 −1.4 −6.0 2.4 0.2 T226 −5.1 −1.8 L227 4.5 0.5 5.3 0.3 S231 111.7 −4.9 C233 −7.9 −2.1 V235 −1.6 −6.8 E238 — — — −11.4 0.6 16.9 E258 14.0 −7.2 62.5 −1.1 −13.0 84.3 R259 −4.5 3.5 −52.7 73.8 −1.2 2.6 −5.9 −120.7 −85.6 0.0 L260 20.9 −12.0 6.8 −3.3 S261 0.9 4.1 Y262 −11.7 2.4 −5.4 −6.2 K264 12.2 −1.4 −9.0 7.4 6.6 −1.7 −129.1 7.8 R265 90.0 −33.2 139.5 −16.2 0.1 3.3 2.0 11.5 −6.7 0.1 Pass 31 23 6 5 6 34 28 12 9 6 Fail 6 3 5 3 0 6 4 5 1 0

In Table 7, residue numbers refer to positions in the T33-21 design model, the “pass” values are the number of residues where |Δchi|≤25 m, and the “fail” values are the number of residues where |Δchi|>25. Residues with missing atoms in the crystal structure, for which a Δchi value could not be determined, are indicated with a dash. All Δchi values are reported in degrees.

As intended, the designed interfaces can drive assembly of cage-like nanomaterials that closely match the computational design models: the backbone RMSD over all 24 subunits in each material range from 1.0 to 2.6 Å. The precise control over interface geometry offered by our method thus enables the design of two-component protein nanomaterials with diverse nanoscale features such as surfaces, pores, and internal volumes with high accuracy.

The method described here can provide a general route to designing multi-component protein-based nanomaterials and molecular machines with programmable structures and functions. The capability to design highly homogeneous protein nanostructures with atomic-level accuracy and controllable assembly can open new opportunities in targeted drug delivery, vaccine design, plasmonics, and other applications that can benefit from the precise patterning of matter on the sub-nanometer to hundred nanometer scale.

EXPERIMENTAL METHODS

Amino Acid Sequences.

Enumerated below are the amino acid sequences for the five successful designs that were characterized in detail in this study (T32-28, T33-09, T33-15, T33-21, and T33-28) along with the wild-type proteins from which these designs were derived (referred to by their Protein Data Bank accession numbers followed by the suffix “-wt”). As described in the main text, each designed material comprises a pair of designed proteins. The two components are referred to here by the name of the designed material followed by the suffix “A” or “B”. The amino acid sequences of the two C-terminal tags used in this study are also presented.

3lzl-wt (dimeric scaffold for T32-28A)  (SEQ ID NO: 41) MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHAL KNNPNGFPEGFWMPYLTIAYELKNTDTGAIKRGTLMPMVADDGPHYGANI AMEKDKKGGFGVGNYELTFYISNPEKQGFGRHVDEETGVGKWFEPFKVDY KFKYTGTPK 3n79-wt (trimeric scaffold for T32-28B) (SEQ ID NO: 42) MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTICPGKFLLMLGGDI GAIQQAIETGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIV ETWSVAACISAADRAVKGSNVTLVRVHMAFGIGGKCYMVVAGDVSDVNNA VTVASESAGEKGLLVYRSVIPRPHEAMWRQMVEG 1nza-wt (trimeric scaffold for T33-09A) (SEQ ID NO: 43) MEEVVLITVPSEEVARTIAKALVEERLAACVNIVPGLTSIYRWQGEVVED QELLLLVKTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRE NTG 1ufy-wt (trimeric scaffold for T33-09B and T33-15B) (SEQ ID NO: 44) MVRGIRGAITVEEDTPEAIHQATRELLLKMLEANGIQSYEELAAVIFTVT EDLTSAFPAEAARQIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQ DRVRHVYLREAVRLRPDLESAQ 3k6a-wt (trimeric scaffold for T33-15A) (SEQ ID NO: 46) MSKAKIGIVTVSDRASAGIYEDISGKAIIDTLNDYLTSEWEPIYQVIPDE QDVIETTLIKMADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFG ELMRAESLKFVPTAILSRQTAGLRGDSLIVNLPGKPKSIRECLDAVFPAI PYCIDLMEGPYLECNEAVIKPFRPKAK 1wy1-wt (trimeric scaffold for T33-21A) (SEQ ID NO: 47) MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDE EMKGILEEIQNDIYKIMGEIGSKGKIEGISEERIKWLEGLISRYEEMVNL KSFVLPGGTLESAKLDVCRTIARRAERKVATVLREFGIGKEALVYLNRLS DLLFLLARVIEIEKNKLKEVRS 3e6q-wt (trimeric scaffold for T33-21B) (SEQ ID NO: 48) MPHLVIEATANLRLETSPGELLEQANAALFASGQFGEADIKSRFVTLEAY RQGTAAVERAYLHACLSILDGRDAATRQALGESLCEVLAGAVAGGGEEGV QVSVEVREMERASYAKRVVARQR 3fuy-wt (trimeric scaffold for T33-28A) (SEQ ID NO: 49) MESVNTSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMKG VRDAQQSIGDDTEFGFKGPHIRIRCVDIDDKHTYNAMVYVDLIVGTGASE VERETAEELAKEKLRAALQVDIADEHSCVTQFEMKLREELLSSDSFHPDK DEYYKDFL 3fwu-wt (trimeric scaffold for T33-28B) (SEQ ID NO: 50) MPVIQTFVSTPLDHHKRENLAQVYRAVTRDVLGKPEDLVMMTFHDSTPMH FFGSTDPVACVRVEALGGYGPSEPEKVTSIVTAAITKECGIVADRIFVLY FSPLHCGWNGTNF T32-28A (SEQ ID NO: 1) MGEVPIGDPKELNGMEIAAVYLQPIEMEPRGIDLAASLADIHLEADIHAL KNNPNGFPEGFWMPYLTIAYALANADTGAIKTGTLMPMVADDGPHYGANI AMEKDKKGGFGVGTYALTFLISNPEKQGFGRHVDEETGVGKWFEPFVVTY FFKYTGTPK T32-28B (SEQ ID NO: 2) MSQAIGILELTSIAKGMELGDAMLKSANVDLLVSKTISPGKFLLMLGGDI GAIQQAIETGTSQAGEMLVDSLVLANIHPSVLPAISGLNSVDKRQAVGIV ETWSVAACISAADLAVKGSNVTLVRVHMAFGIGGKCYMVVAGDVLDVAAA VATASLAAGAKGLLVYASIIPRPHEAMWRQMVEG T33-09A (SEQ ID NO: 3) MEEVVLITVPSALVAVKIAHALVEERLAACVNIVPGLTSIYRWQGSVVSD HELLLLVKTTTHAFPKLKERVKALHPYTVPEIVALPIAEGNREYLDWLRE NTG T33-09B (SEQ ID NO: 4) MVRGIRGAITVEEDTPAAILAATIELLLKMLEANGIQSYEELAAVIFTVT EDLTSAFPAEAARLIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQ DRVRHVYLNEAVRLRPDLESAQ T33-15A (SEQ ID NO: 5) MSKAKIGIVTVSDRASAGITADISGKAIILALNLYLTSEWEPIYQVIPDE QDVIETTLIKMADEQDCCLIVTTGGTGPAKRDVTPEATEAVCDRMMPGFG ELMRAESLKEVPTAILSRQTAGLRGDSLIVNLPGDPASISDCLLAVFPAI PYCIDLMEGPYLECNEAMIKPFRPKAK T33-15B (SEQ ID NO: 6) MVRGIRGAITVNSDTPTSIIIATILLLEKMLEANGIQSYEELAAVIFTVT EDLTSAFPAEAARQIGMHRVPLLSAREVPVPGSLPRVIRVLALWNTDTPQ DRVRHVYLSEAVRLRPDLESAQ T33-21A (SEQ ID NO: 7) MRITTKVGDKGSTRLFGGEEVWKDSPIIEANGTLDELTSFIGEAKHYVDE EMKGILEEIQNDIYKIMGEIGSKGKIEGISEERIAWLLKLILRYMEMVNL KSFVLPGGTLESAKLDVCRTIARRALRKVLTVTREFGIGAEAAAYLLALS DLLFLLARVIEIEKNKLKEVRS T33-21B (SEQ ID NO: 8) MPHLVIEATANLRLETSPGELLEQANKALFASGQFGEADIKSRFVTLEAY RQGTAAVERAYLHACLSILDGRDIATRTLLGASLCAVLAEAVAGGGEEGV QVSVEVREMERLSYAKRVVARQR T33-28A (SEQ ID NO: 9) MESVNTSFLSPSLVTIRDFDNGQFAVLRIGRTGFPADKGDIDLCLDKMIG VRAAQIFLGDDTEDGFKGPHIRIRCVDIDDKHTYNAMVYVDLIVGTGASE VERETAEEEAKLALRVALQVDIADEHSCVTQFEMKLREELLSSDSFHPDK DEYYKDFL T33-28B (SEQ ID NO: 10) MPVIQTFVSTPLDHHKRLLLAIIYRIVTRVVLGKPEDLVMMTFHDSTPMH FFGSTDPVACVRVEALGGYGPSEPEKVTSIVTAAITAVCGIVADRIFVLY FSPLHCGWNGTNF A1 tag (for fluorescent  labeling and lysate screening) (SEQ ID NO: 45) LEGGDSLDMLEWSL hexahistidine tag (for purification) (SEQ ID NO: 51) LEHHHHHH

Protein Expression, Lysate Screening, and Purification.

Codon-optimized genes encoding the designed and corresponding wild-type proteins were either purchased (Gen9) or constructed from sets of purchased oligonucleotides (Integrated DNA Technologies) by recursive PCR All genes were cloned using the Gibson assembly method into a variant of the pET29b expression vector (Novagen) that had been digested by NdeI and XhoI restriction endonucleases. The genes encoding the wild-type proteins were each cloned into the vector individually, while the genes encoding the designed proteins were cloned in pairs along with the following intergenic region derived from the pETDuet-1 vector (Novagen):

(SEQ ID NO: 52) 5′TAATGCTTAAGTCGAACAGAAAGTAATCGTATTGTACACGGCCGCATA ATCGAAATTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATT CCCCATCTTAGTATATTAGTTAAGTATAAGAAGGAGATATACAT-3′

The constructs for the designed protein pairs thus possessed the following set of elements from 5′ to 3′: NdeI restriction site, upstream gene, intergenic region, downstream gene, XhoI restriction site. The upstream genes encoded components denoted with the suffix “A” above; the downstream genes encoded the “B” components. This allowed for co-expression of the designed protein pairs in which both the upstream and downstream gene had their own T7 promoter/lac operator and ribosome binding site.

The pET29b variant used for the initial constructs appended the A1 peptide tag (vide supra) to the C terminus of each wild-type gene and to the downstream gene of each designed protein pair for fluorescent labeling via the AcpS system. For purification purposes, vectors encoding C-terminally His-tagged versions of the designed protein pairs, the individual protein components, and the corresponding wild-types were subsequently constructed by subcloning (via Gibson assembly) into the standard pET29b vector between the NdeI and XhoI restriction sites. As with the A1 peptide tag, the hexahistidine tag was only appended to the downstream component in the co-expression constructs.

Expression plasmids were transformed into BL21 (DE3) E. coli cells. Cells were grown in LB medium supplemented with 50 mg L⁻¹ of kanamycin (Sigma) at 37° C. until an OD₆₀₀ of 0.8 was reached. Protein expression was induced by addition of 0.5 mM isopropyl-thio-β-D-galactopyranoside (Sigma) and allowed to proceed for either 5 h at 22° C. or 3 h at 37° C. before cells were harvested by centrifugation.

The designed proteins were screened for assembly by subjecting cleared lysates to native (non-denaturing) PAGE as described previously in the context of at least FIGS. 5A-5G. Single bands for each of the five successful materials were visible when stained with GelCode™ Blue (Thermo Scientific). In these initial screens, all constructs were tested under both the 22° C. and the 37° C. expression conditions. Based on these results, in all subsequent work T32-28, T33-28, and the corresponding wild-type proteins were expressed at 22° C., while T33-09, T33-15, T33-21, and the corresponding wild-type proteins were expressed at 37° C.

For purification, cells were lysed by sonication in 50 mM TRIS pH 8.0, 250 mM NaCl, 1 mM DTT, 20 mM imidazole supplemented with 1 mM phenylmethanesulfonyl fluoride, and the lysates were cleared by centrifugation and filtered through 0.22 μM filters (Millipore). The proteins were purified from the filtered supernatants by nickel affinity chromatography on HisTrap™ HP columns (GE Life Sciences) and eluted using a linear gradient of imidazole (0.02-0.5 M). Fractions containing pure protein(s) of interest were pooled, concentrated using centrifugal filter devices (Sartorius Stedim Biotech), and further purified on a Superdex™ 200 30/100 gel filtration column (GE Life Sciences) using 25 mM TRIS pH 8.0, 150 mM NaCl, 1 mM DTT as running buffer. Gel filtration fractions containing pure protein in the desired assembly state were pooled, concentrated, and stored at room temperature or 4° C. for subsequent use in analytical size exclusion chromatography, in vitro mixing, electron microscopy, and X-ray crystallography.

Analytical Size Exclusion Chromatography.

Analytical SEC was performed on a Superdex™ 200 30/100 gel filtration column (GE Life Sciences) using 25 mM TRIS pH 8.0, 150 mM NaCl, 1 mM DTT as the running buffer. The designed materials were loaded onto the column with each component present at a subunit concentration of 50 μM. Individual designed components and wild-type proteins were loaded at a concentration of 50 μM. The apparent molecular weights of the designed proteins were estimated by comparison to the corresponding wild-type proteins and a set of globular protein standards.

In Vitro Mixing.

Individual components of the five successful designs were expressed from pET29b vectors encoding C-terminally His-tagged versions of each component (under the same induction conditions outlined above). Lysates containing corresponding pairs of designed components were mixed either immediately following lysis (crude lysates) or after clearance by centrifugation (cleared lysates). Each was mixed with either a one-to-one volumetric ratio or with adjusted volumetric ratios intended to account for observed differences in expression levels of the two components in each designed pair. After incubating for two hours at room temperature, insoluble material was cleared by centrifugation and the samples were subjected to native PAGE analysis. For comparison, these samples were analyzed together with cleared lysates of unmixed component A and B, and cleared lysates from co-expressed A1-tagged designs, co-expressed His-tagged designs, and corresponding His-tagged wild-types. Bands corresponding to the assembled state were clearly visible in the crude lysate mixtures of T32-28 and T33-15. Corresponding bands for T32-28 and T33-15 were also visible in the cleared lysate mixtures, although noticeably less intense in the case of T32-28. It is also noteworthy that while the A1-tagged co-expression construct of T33-09 yielded a visible band for the assembled material, the His-tagged co-expression construct did not. While the His-tagged construct also provided low yield from purification, it did clearly express and assemble (as shown by size exclusion chromatography and electron microscopy). Thus the concentration of the His-tagged assembly appears to be below the detection limit of our native PAGE analysis.

Based on the results from the mixed lysates experiments, T32-28 and T33-15 were additionally subjected to in vitro mixing experiments from purified components. Each of the C-terminally His-tagged components was purified by nickel affinity and gel filtration chromatography, and the purified components were mixed in a 1:1 molar ratio with each component present at a subunit concentration of 50 μM. Following incubation for two hours at room temperature, the mixtures were subjected to analytical size exclusion chromatography. The purifications and size exclusion chromatography were carried out as described above with the exception that 5% (v/v) glycerol was added to all buffers. While T33-15 assembled efficiently from the independently purified components, T32-28 yielded only a small peak for the assembly product. The purified T32-28A component eluted significantly earlier than 3lzl-wt, indicating that lack of assembly in this case may be due to aggregation of the T32-28A component in the absence of T32-28B.

For T32-28A and 3lzl-wt containing samples, DTT was excluded from all buffers and 1 mM CuSO₄ added to the lysis buffer. This was done in accordance with previous work on the 3lzl-wt protein, which revealed copper binding sites at the dimeric interface and putative copper-dependent dimerization. While T32-28 did yield a native PAGE band and a size exclusion peak corresponding to the 24mer assembly without these modifications to the buffers, the purified assemblies were found to partially dissociate upon dilution (as assessed by size exclusion chromatography). In contrast, lysis and purification with the modified buffers yielded stable assemblies with no detectable disassembly upon dilution.

Negative Stain Electron Microscopy.

2-3 μl of purified T32-28, T33-09, T33-15, T33-21 and T33-28 samples at concentrations ranging from 0.01 mg/mL to 5 mg/mL were applied to negatively glow discharged, carbon coated 200-mesh copper grids (Ted Pella, Inc.), washed with Milli-Q™ water and stained with 0.075% uranyl formate. Grids were visualized for oligomer validation and optimized for data collection. Screening and data collection was performed on a 120 kV Tecnai Spirit™ T12 transmission electron microscope (FEI, Hillsboro, Oreg.). All images were recorded using a Teitz CMOS 4k camera at either 49,000× (T33-21 and T33-28) or 60,000× (T32-28, T33-09 and T33-15) magnification.

Coordinates for 3,910 (T32-28), 29,153 (T33-09), 18,197 (T33-15), 5,478 (T33-21) and 13,715 (T33-28) unique particles were obtained for averaging using either Ximdisp™ or EMAN™. Extracted frames of these particles were used to obtain class averages by refinement in either SPIDER™ or IMAGIC™ using multiple rounds of MSA (multivariate statistical analysis) and MRA (multi-reference alignment). A low-resolution (17-30 Å) volume from the design .pdb files outputted from Rosetta3 was obtained using SPIDER™ and validated using UCSF Chimera. Back-projection images were obtained by calculation using SPIDER™ on the low-resolution volumes and visualized using WEB.

Separated, purified components (T33-15A and T33-15B) were screened as above, T33-15A and T33-15B were then mixed in a 1:1 ratio and grids prepared of the mixture after 5 minutes, 1 hour and 2 hours at room temperature and screened as above.

Crystallization of T32-28.

T32-28 was crystallized with hanging drop vapor diffusion at room temperature. Crystals were formed within four days by mixing 1 uL of 11.7 mg mL⁻¹ protein and 1 uL of a 500 uL well solution containing only 1.675 M D,L-malic acid at pH 7.0. The crystals were cryo-protected in 2.0 M lithium sulfate and soaked for 20 seconds. The crystals diffracted to at least 4.5 Å and the asymmetric unit contained 12 molecules of T32-28A and 12 molecules of T32-28B in space group P3₁21.

Crystallization of T33-15.

As described above, crystals of T33-15 were grown within one week by mixing 1 uL of 7.6 mg mL⁻¹ protein and 1 uL of a 500 uL well solution containing 100 mM sodium cacodylate at pH 6.5, 200 mM calcium acetate, and 28% (v/v) PEG 300. Crystals were cryo-protected by successive 30-second soaking in 10 uL solutions of mother liquor with glycerol added at final concentrations of 5%, 10%, 15%, and 20%. The crystals diffracted to at least 2.8 Å and the asymmetric unit contained one molecule each of T33-15A and T33-15B molecules in space group F432.

Crystallization of T33-21 in Space Groups R32 and F4₁32.

T33-21 was crystallized similarly as described above. Crystals grew within three weeks following the mixing of 1 uL of 8.6 mg mL⁻¹ protein and 0.5 uL of a 200 uL well solution containing 100 mM citric acid pH at 5.0 and 800 mM ammonium sulfate. Crystals were cryo-protected with 2.0 M lithium sulfate as described above. The crystals diffracted to at least 2.0 Å and the asymmetric unit contained 4 molecules each of T33-21A and T33-21B in space group R32.

Alternatively, crystals also grew within one week by mixing 0.5 uL of 8.6 mg mL⁻¹ protein and 1 uL of a 200 uL well solution containing 100 mM Bis-Tris at pH 5.5 and 2.12 M ammonium sulfate. Cryo-protection was performed with 2.0 M lithium sulfate as described above. These crystals diffracted to at least 2.6 Å and the asymmetric unit contained one molecule each of T33-21A and T33-21B in space group F4₁32.

Crystallization of T33-28.

T33-28 was crystallized as described above. Crystals grew within three days in hanging drops containing 0.5 uL of 15.8 mL⁻¹ protein and 0.5 uL of a 200 uL well solution containing 100 mM sodium citrate tribasic dihydrate pH at 5.6, 200 mM ammonium acetate, and 24% (v/v) (+/−)-2-methyl-2,4-pentanediol. Cryo-protection involved passage of the crystal through drops of paratone-N oil until no more mother liquor appeared present around the crystal. The crystals diffracted to at least 3.5 Å and the asymmetric unit contained 48 molecules each of T33-28A and T33-28B in space group P2₁.

Crystallographic Data Collection and Structure Determination.

Diffraction data sets were collected at the Advanced Photon Source (APS) beamline 24-ID-C equipped with a Pilatus™-6M detector. All data were collected at 100 K. Data were collected for T32-28, T33-15, T33-21 (space group R32), T33-21 (space group F4₁32), and T33-28 at detector distances of 650 mm, 450 mm, 300 mm, 300 mm, and 575 mm; with 0.5°, 0.5°, 0.2°, 0.5°, and 0.5° degree oscillations; and at wavelengths of 0.9793 Å, 0.9792 Å, 1.0393 Å, 0.9716 Å, and 0.9793 Å, respectively.

Data reduction, integration, and scaling were performed with XDS/XSCALE™. The program PHASER™ was used to determine all crystal structures by molecular replacement (MR). For T33-15 and T33-21 structures, the MR search models were the original PDB scaffolds for each computationally-designed component. The MR search models for the structures of T33-28 and T32-28 were models of the tetrahedral assemblies with and without side-chain atoms beyond β-carbons, respectively.

The X-ray diffraction data collected for T32-28 underwent additional processing in XSCALE™ to visualize anomalous scattering from copper ions anticipated in the T32-28A subunits. The data was scaled with unmerged Friedel mates and the resultant electron density map was used to calculate an anomalous Fourier map with the refined model in PHENIX™. The anomalous peaks in the calculated map were not used to model copper ions in the final structure due to unmodeled, coordinating side chains. All deposited structure factors used for refinement were scaled with merged Friedel mates.

Crystallographic Refinement.

All refinement steps were run using the phenix.refine module of PHENIX™. Molecular replacement solutions were first refined with rigid body refinement, and then underwent individual coordinate refinement in addition to other strategies. Refinement strategies were tested comparing grouped and individual atomic displacement parameter (ADP) refinement, translation libration screw-motion (TLS) group definitions, and simulated annealing. Each refinement protocol was iteratively run while the quality of the model between runs was assessed in COOT™ using the 2mF_(o)-DF_(c) with unfilled F_(obs) map and the mF_(o)-DF_(c) difference map. Subsequent cycles of alternating refinement and model adjustment in COOT were performed to obtain the final refined models.

T32-28, T33-15, T33-21 (space group F4₁32), and T33-28 were refined with individual isotropic ADP parameterization with 1 TLS group per polypeptide chain. T32-28 was refined as a model comprised of glycine, alanine, proline, and all other side chains truncated to the β-carbon due to poor electron density visibility in regions occupied by side chains. T33-15 was refined with reference model restraints assigned to T33-15B from chain A of PDB entry 1UFY. T33-21 (space group R32) was refined with individual isotropic ADP parameterization and 3-8 TLS group definitions per chain determined near residual minimization from the TLSMD server.

Model quality was assessed during and after refinement using geometric validation and MolProbity™ tools as a part of the PHENIX™ suite. Structures of T33-15, T33-21, and T33-28 contain 97-100% of the residues within the most favored regions of the Ramachandran plot. Residues in the disallowed regions of the Ramachandran plot are found in T32-28 at positions where the phi and psi angles of the scaffold protein are also disallowed. T32-28, T33-15, and both T33-21 structures have ERRAT scores of 97.0%, 96.6%, 99.4%, and 98.2%, respectively. ERRAT scores indicate the percentage of residues that fall below the 95% confidence limit for erroneous modeling. The large asymmetric unit of the T33-28 structure was inspected with VERIFY3D due to incompatibility with ERRAT, and resulted in a passing score of greater than 80% of residues scored greater than or equal to 0.2 in the 3D/1D profile. The coordinates of the final models and the merged structure factors have been deposited in the Protein Data Bank with PDB codes 4NWN, 4NWO, 4NWR, 4NWP, and 4NWQ.

The particulars shown herein are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of various embodiments of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for the fundamental understanding of the invention, the description taken with the drawings and/or examples making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.

The above definitions and explanations are meant and intended to be controlling in any future construction unless clearly and unambiguously modified in the following examples or when application of the meaning renders any construction meaningless or essentially meaningless. In cases where the construction of the term would render it meaningless or essentially meaningless, the definition should be taken from Webster's Dictionary, 3^(rd) Edition or a dictionary known to those of skill in the art, such as the Oxford Dictionary of Biochemistry and Molecular Biology (Ed. Anthony Smith, Oxford University Press, Oxford, 2004).

As used herein and unless otherwise indicated, the terms “a” and “an” are taken to mean “one”, “at least one” or “one or more”. Unless otherwise required by context, singular terms used herein shall include pluralities and plural terms shall include the singular.

Unless the context clearly requires otherwise, throughout the description and the claims, the words ‘comprise’, ‘comprising’, and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to”. Words using the singular or plural number also include the plural or singular number, respectively. Additionally, the words “herein,” “above” and “below” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application.

The above description provides specific details for a thorough understanding of, and enabling description for, embodiments of the disclosure. However, one skilled in the art will understand that the disclosure may be practiced without these details. In other instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The description of embodiments of the disclosure is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. While specific embodiments of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize.

All of the references cited herein are incorporated by reference. Aspects of the disclosure can be modified, if necessary, to employ the systems, functions and concepts of the above references and application to provide yet further embodiments of the disclosure. These and other changes can be made to the disclosure in light of the detailed description.

Specific elements of any of the foregoing embodiments can be combined or substituted for elements in other embodiments. Furthermore, while advantages associated with certain embodiments of the disclosure have been described in the context of these embodiments, other embodiments may also exhibit such advantages, and not all embodiments need necessarily exhibit such advantages to fall within the scope of the disclosure.

The above detailed description describes various features and functions of the disclosed systems, devices, and methods with reference to the accompanying figures. In the figures, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.

With respect to any or all of the ladder diagrams, scenarios, and flow charts in the figures and as discussed herein, each block and/or communication may represent a processing of information and/or a transmission of information in accordance with example embodiments. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, functions described as blocks, transmissions, communications, requests, responses, and/or messages may be executed out of order from that shown or discussed, including substantially concurrent or in reverse order, depending on the functionality involved. Further, more or fewer blocks and/or functions may be used with any of the ladder diagrams, scenarios, and flow charts discussed herein, and these ladder diagrams, scenarios, and flow charts may be combined with one another, in part or in whole.

A block that represents a processing of information may correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a block that represents a processing of information may correspond to a module, a segment, or a portion of program code (including related data). The program code may include one or more instructions executable by a processor for implementing specific logical functions or actions in the method or technique. The program code and/or related data may be stored on any type of computer readable medium such as a storage device including a disk or hard drive or other storage medium.

The computer readable medium may also include non-transitory computer readable media such as computer-readable media that stores data for short periods of time like register memory, processor cache, and random access memory (RAM). The computer readable media may also include non-transitory computer readable media that stores program code and/or data for longer periods of time, such as secondary or persistent long term storage, like read only memory (ROM), optical or magnetic disks, compact-disc read only memory (CD-ROM), for example. The computer readable media may also be any other volatile or non-volatile storage systems. A computer readable medium may be considered a computer readable storage medium, for example, or a tangible storage device.

Moreover, a block that represents one or more information transmissions may correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions may be between software modules and/or hardware modules in different physical devices.

Numerous modifications and variations of the present disclosure are possible in light of the above teachings. 

1-25. (canceled)
 26. A method, comprising: generating a plurality of representations of a first protein building block using a computing device; generating a plurality of representations of a second protein building block using the computing device, wherein the first protein building block differs from the second protein building block; generating an arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block according to symmetric operations of a designated mathematical symmetry group using the computing device; computationally determining a docked configuration of the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block by at least generating at least one interface for each protein building block of the arrangement that is suitable for computational protein-protein interface design using the computing device; computationally modifying amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block in the docked configuration to specify a plurality of representations of protein-protein interfaces, wherein the plurality of representations of protein-protein interfaces comprise one or more representations of protein-protein interfaces between the first protein building block and the second protein building block that are energetically favorable to drive self-assembly of the protein building blocks comprising the modified amino acid sequences to the docked configuration using the computing device; and generating an output of the computing device that is based on at least one representation of the group consisting of: a representation of the docked configuration, at least one representation of the plurality of representations of the protein-protein interfaces, and at least one representation of the representations of the first protein building block and the representations of the second protein building block having modified amino acid sequences.
 27. The method of claim 26, where each of the first and second protein building blocks comprise a synthetic polypeptide.
 28. The method of claim 26, where each of the first and second protein building blocks comprise a protein multimer that shares an axis of symmetry with the designated mathematical symmetry group.
 29. The method of claim 26, where the designated mathematical symmetry group conforms to a symmetry selected from tetrahedral point group symmetry, octahedral point group symmetry, and icosahedral point group symmetry.
 30. The method of claim 26, where generating the arrangement of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises computationally aligning symmetry axes of the first protein building block and the second protein building block with at least one axis in the designated mathematical symmetry group.
 31. The method of claim 30, wherein determining a docked configuration of the plurality of the first and second protein building blocks further comprises: sampling rotational degrees of freedom and translational degrees of freedom for each of the first and second protein building blocks.
 32. The method of claim 31, wherein sampling the rotational degrees of freedom and the translational degrees of freedom comprises: selecting a rotational value for a rotational degree of freedom for each of the first and second protein building blocks; selecting a translational value for a translational degree of freedom for each of the first and second protein building blocks; determining a sampled representation of the first protein building block based on the selected rotational value for the first protein building block and the selected translational value for the first protein building block; determining a sampled representation of the second protein building block based on the selected rotational value for the second protein building block and the selected translational value for the second protein building block; and determining a designability measure for the docked configuration using the sampled representation of the first protein building block and the sampled representation of the second protein building block.
 33. The method of claim 32, wherein determining the designability measure of the docked configuration comprises determining a number of beta carbon contacts within a specified distance threshold between the sampled representation of the first protein building block and the sampled representation of the second protein building block in the docked configuration based on the values of the selected rotational and translational degrees of freedom.
 34. The method of claim 26, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises selecting a selected representation of one or more amino acid sequences associated with a representation of at least one protein building block of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block.
 35. The method of claim 34, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises computationally mutating an amino acid sequence of the selected representation of one or more amino acid sequences.
 36. The method of claim 26, wherein computationally modifying the amino acid sequences of the plurality of representations of the first protein building block and the plurality of representations of the second protein building block comprises evaluating an energy of an amino acid mutation using a computational score function. 